For daily terabytes of data collection, these systems typically require the following characteristics:
- Construct the bridge of application system and analysis system, and decouple the association between them;
- Support for near real-time online analysis systems and offline analysis systems like Hadoop;
- With high scalability. That is, when the amount of data increases, it can be scaled horizontally by increasing the nodes.
Description of open source components from design architecture, load balancing, scalability, and fault tolerance
Facebook's scribe
scribe is Facebook's open-source log collection system, which has been used extensively within Facebook. It collects logs from a variety of log sources, stores them on a central storage system (which can be NFS, distributed file systems, etc.) for centralized statistical analysis processing. It provides a scalable, high-fault-tolerant solution for "distributed collection, unified processing" of logs.
Its most important feature is the good fault tolerance . When the back-end storage system is crash, Scribe writes the data to the local disk, and when the storage system returns to normal, scribe reloads the log to the storage system.
Architecture:The architecture of scribe is relatively simple, including three parts, namely scribe Agent, scribe and storage System .
(1) scribe Agent
The scribe agent is actually a thrift client. The only way to send data to scribe is to use the thrift client, which scribe internally defines a thrift interface that users use to send data to the server
(2) scribe
scribe receives the data sent by thrift client, and sends different topic data to different objects according to the configuration file. scribe provides a variety of stores, such as file, HDFs, and so on, scribe can load data into the store.
(3) Storage System
The storage system is actually the store in scribe, and the current scribe supports very many stores, including:
-
- File (Files)
- Buffer (double storage, one primary storage, one secondary storage)
- Network (another scribe server)
- Buckets (contains multiple stores, with hash data stored in a different store)
- Null (ignoring data)
- Thriftfile (written in a thrift Tfiletransport file)
- multi (store the data in a different store).
Apache's Chukwa
Chukwa is a Hadoop family that uses a lot of Hadoop components (stored in HDFs, processing data with MapReduce), and it provides many modules to support Hadoop cluster log analysis. The structure is as follows:
There are 3 main characters in Chukwa, namely: Adaptor,agent,collector
Agent
- The agent is the program that is responsible for collecting data on each node.
- The Agent is also composed of several adapter.
- The adapter runs within the Agent process and performs the actual data collection work.
- The Agent is responsible for the management of the adapter (to provide various services to adaptor, including: Start and close adaptor, the data passed to the collector via HTTP; periodically record the adaptor status for crash recovery)
Collector
- Merging data from multiple data sources and then loading into HDFs, hiding the details of the HDFS implementation, such as when the HDFs version is replaced, simply modify the collector to
HDFS Storage System
- Chukwa uses HDFS as the storage system.
- HDFs is designed to support large file storage and small concurrent high-speed write scenarios, and the log system is the opposite, it needs to support high concurrency low-rate write and a large number of small file storage.
- Note that small files that are written directly to HDFs are not visible until the file is closed, and HDFs does not support file re-opening
Demux and achieving
- Direct support for processing data using MapReduce.
- It contains two mapreduce jobs, which are used to obtain data and convert data to a structured log. stored in the data store (either database or HDFS, etc.)
LinkedIn's Kafka
Kafka is a December 2010 Open source project, written in the Scala language, using a variety of efficiency optimization mechanisms, the overall architecture is relatively new (push/pull), more suitable for heterogeneous clusters. The main design elements are as follows:
- Afka is designed to consider persistent messages as a common use case.
- The primary design constraint is throughput, not functionality.
- State information about what data has been used is saved as part of the data consumer (consumer) instead of being stored on the server.
- Kafka is an explicit distributed system. It assumes that data producers (producer), proxies (brokers), and data consumers (consumer) are scattered over multiple machines.
Architecture:
- Kafka is actually a message publishing subscription system.
- Producer publishes a message to a topic, and consumer subscribes to a topic message, and once there is a new message about a topic, the broker is passed to all consumer that subscribed to it.
- In Kafka, messages are organized by topic, and each topic is divided into multiple partition, which makes it easy to manage data and load balance.
- At the same time, it uses zookeeper for load balancing.
There are three main characters in Kafka, namely Producer,broker and consumer
Producer
- The task of producer is to send data to the broker.
- Kafka provides two producer interfaces, one Low_level interface, which sends data to a certain partition under one topic of a particular broker, and one that is a high-level interface that supports synchronous/asynchronous sending of data , zookeeper-based broker automatic recognition and load balancing (based on partitioner).
- Among them, broker automatic recognition based on zookeeper is worth saying. Producer can obtain a list of available brokers through zookeeper, or you can register listener in zookeeper, which is woken up in the following situations:
- A Add a Broker
- b Delete a broker
- C Register a new Topic
- D Broker Registration of existing topic
Broker
Broker has adopted a variety of strategies to improve data processing efficiency, including sendfile and zero copy technology
Consumer
- The role of consumer is to load log information onto a central storage system.
- The Kafka provides two consumer interfaces, one that is low, that maintains a connection to a broker, and that the connection is stateless, that is, each time the data is pull from the broker, the offset of the broker data is told. The other is the high-level interface, which hides the details of the broker, allowing consumer to push data from the broker without having to care about the network topology.
- More importantly, for most log systems, the data information that consumer has acquired is saved by the broker, while in Kafka, the data information is maintained by consumer itself.
Cloudera's Flume
Flume is Cloudera's Open source log system in July 2009. It has a wide range of built-in components that users can use with little or no additional development. The design goals are as follows:
- reliability: When a node fails, the log can be transmitted to other nodes without loss. Flume provides three levels of reliability assurance, from strong to weak in order: Scalability : Flume using a three-tier architecture, respectively asked Agent,collector and storage, each layer can be horizontally expanded. where all agents and collector are managed by master, which makes the system easy to monitor and maintain, and master allows for multiple (management and load balancing using zookeeper), which avoids a single point of failure
- End-to-end (receives the data agent writes the event to the disk first, when the data transfer succeeds, then deletes; if the data send fails, you can resend it. )
- Store On Failure (this is also the strategy adopted by scribe, when the data receiver crash, writes the data to the local, after recovering, continues to send)
- Best effort (no confirmation after data is sent to the receiver)
- Scalability : Flume uses a three-tier architecture that asks Agent,collector and storage, each of which can be scaled horizontally. where all agents and collector are managed by master, which makes the system easy to monitor and maintain, and master allows for multiple (management and load balancing using zookeeper), which avoids a single point of failure
- Manageability : all Agents and Colletor are managed centrally by master, which makes the system easy to maintain. Users can view individual data sources or data flow executions on master, and can be configured and dynamically loaded on individual data sources. Flume provides Web and Shell script command two forms of data flow management
- Feature Extensibility: users can add their own agent,colletor or storage as needed. In addition, Flume comes with a number of components, including various agents (file, syslog, etc.), collector and storage (FILE,HDFS, etc.).
Architecture diagram:
The flume uses a layered architecture consisting of three layers of agent,collector and storage, respectively. The agent and collector are composed of two parts: source and Sink,source are data sources, and sink is the data whereabouts.
(1) Agent : The role of the agent is to send data source data to Collector,flume comes with a lot of directly available data sources (source), such as:
- Text ("filename"): Send file filename as a data source, by row
- Tail ("filename"): Detects the new data generated by filename and sends it by line
- Fsyslogtcp (5140): Listens to TCP's 5140 port and sends out incoming data
- A number of sink are also available, such as:
- console[("format")]: Display data directly on the desktop
- Text ("txtfile"): Writes data to file txtfile
- DFS ("Dfsfile"): Writes data to the Dfsfile file on HDFs
- SYSLOGTCP ("host", Port): Passing data over TCP to the host node
(2) Collector:The function of collector is to load the data of multiple agents into storage. Its source and sink are similar to agents.
(3) Storage: storage is a storage system, can be a common file, it can be hdfs,hive,hbase and so on.
Related Concepts of event:
The core of Flume is to collect data from a data source, sending the collected data to a specified destination (sink). In order to ensure that the delivery process must be successful, before sending to the destination (sink), the data will be cached (channel), when the data really arrives at the destination (sink), flume delete their own cached data. such as:
During the transmission of the entire data, the event is flowing, that is, the transaction is guaranteed at the event level. So what is an event? -event encapsulates the transmitted data, which is the basic unit of flume transmitted data , and if it is a text file, usually a row of records, the event is the basic unit of the transaction. Event from source, to channel, to sink, is itself a byte array, and can carry headers (header information) information. An event represents the smallest complete unit of data, from an external data source, to an external destination.
Summarize
According to the architecture design of these four systems, we can conclude that the system needs to have three basic components, namely the Agent(encapsulating data source, sending data from data source to collector),collector(receiving data from multiple agents, and are aggregated after importing into the back-end store), thestore(the central storage system should be scalable and reliable and should support the currently very popular HDFs).
The four system results are compared and analyzed as follows:
Resources:
- https://www.ibm.com/developerworks/cn/opensource/os-cn-chukwa/
- Http://www.open-open.com/lib/view/open1386814790486.html
Open source Data Acquisition components comparison: Scribe, Chukwa, Kafka, Flume