distributed Messaging system: Kafka
Kafka is a distributed publish-subscribe messaging system. It was originally developed by LinkedIn and later became part of the Apache project. Kafka is a distributed, partitioned, redundant backup of the persistent log service. It is primarily used to process active streaming data.
In big Data system, often encounter a problem, the whole big data is composed of each subsystem, the data needs in each subsystem of high performance, low-latency continuous flow. Traditional enterprise messaging systems are not ideal for large-scale data processing. In order to have both online applications (messages) and offline applications (data files, logs) Kafka appeared. The Kafka can play two roles:
- Reduce the complexity of system networking.
- Reduce the complexity of programming, each subsystem is not a mutual negotiation interface, each subsystem similar socket plug in the socket, Kafka assume the role of high-speed data bus.
Kafka Main Features:
- It also provides high throughput for both publications and subscriptions. It is understood that the Kafka can produce about 250,000 messages per second (in megabytes), processing 550,000 messages per second (in megabytes).
- Persistent operation is possible. Persist messages to disk, so it can be used for bulk consumption, such as ETL, and real-time applications. Prevent data loss by persisting data to the hard disk and replication.
- Distributed system, easy to scale out. All producer, brokers, and consumer will have multiple, distributed. Extend the machine without downtime.
- The state of the message being processed is maintained on the consumer side, and not by the server side. can automatically balance when it fails.
- Support for online and offline scenarios.
Architecture of the Kayka:
The overall architecture of the Kayka is very simple and is an explicit distributed architecture where producer, broker (Kafka), and consumer can have multiple. Producer,consumer implements Kafka registered interfaces, data is sent from Producer to Broker,broker to assume the role of an intermediate cache and distribution. Broker distributes consumer that are registered to the system. Broker acts like caching, which is the cache between active data and offline processing systems. Client and server-side communication is based on a simple, high-performance, and programming language-independent TCP protocol. Several basic concepts:
- Topic: Refers specifically to different classifications of Kafka processed message sources (feeds of messages).
- Partition:topic A physical grouping, a topic can be divided into multiple Partition, each Partition an ordered queue. Each message in the partition will be assigned an ordered ID (offset).
- Message: Messages are the basic unit of communication, and each producer can post some messages to a topic (subject).
- Producers: message and data producers, the process of releasing messages to a topic of Kafka is called producers.
- Consumers: Message and data consumers, the process of subscribing to topics and processing their published messages is called consumers.
- Broker: Cache proxy, one or more servers in the KAFA cluster are collectively referred to as broker.
The process of sending messages:
- Producer publishes messages to partition of the specified topic according to the specified partition method (Round-robin, hash, etc.)
- The Kafka cluster receives a message from the producer, persists it to the hard disk, and retains the message for a specified length of time (configurable) without paying attention to whether the message is being consumed.
- Consumer pull data from the Kafka cluster and control the offset of the get message
Kayka's design:
1. Throughput
High throughput is one of the core goals that Kafka needs to achieve, and for this reason Kafka has made the following designs:
- Data disk Persistence: Message is not in memory cache, write directly to disk, take full advantage of disk's sequential read and write performance
- Zero-copy: Reduce IO operation steps
- Data Bulk Send
- Data compression
- Topic is divided into multiple partition to improve parallelism
Load Balancing
- Producer sends a message to the specified partition based on the user-specified algorithm
- There are multiple Partiiton, each with its own replica, each replica distributed on different broker nodes
- Multiple partition need to select the lead partition,lead partition is responsible for reading and writing, and the zookeeper is responsible for fail over
- Manage broker and consumer dynamic joins and departures through zookeeper
Pull System
Because Kafka broker persists data, the broker has no memory pressure, so consumer is well suited for pull-consuming data, with the following benefits:
- Simplified Kafka Design
- Consumer control message pull speed according to consumption ability
- Consumer according to their own circumstances to choose consumption patterns, such as batch, repeat consumption, starting from the end of consumption, etc.
Scalability
When the broker node needs to be increased, the new broker will register with zookeeper, and producer and consumer will perceive the changes based on watcher registered on the zookeeper and make adjustments in a timely manner.
Kayka Application Scenarios:
1. Message Queuing
Kafka has better throughput, built-in partitioning, redundancy, and fault tolerance than most messaging systems, making Kafka a good solution for large-scale messaging applications. The messaging system generally has relatively low throughput, but requires a smaller end-to-end delay and a taste of the robust durability protection that is dependent on Kafka. In this field, Kafka is comparable to traditional messaging systems such as ACTIVEMR or RABBITMQ.
2. Behavioral Tracking
Another scenario for Kafka is to track the user's browsing page, search, and other behaviors, and record them in real time to the corresponding topic in a publish-subscribe mode. Then these results are received by subscribers and can be processed in real time or in real time, or in the hadoop/offline Data Warehouse.
3. Meta-Information monitoring
As a monitoring module for operational records, the collection records some operation information, which can be understood as the data monitoring of operation and maintenance.
4. Log Collection
Log collection, in fact, there are many open-source products, including scribe, Apache Flume. Many people use Kafka instead of log aggregation (aggregation). Log aggregation typically collects log files from the server and then places them in a centralized location (file server or HDFS) for processing. However, Kafka ignores the details of the file and abstracts it more clearly into the message flow of a log or event. This makes the Kafka processing process less latency and easier to support multiple data sources and distributed data processing. Compared to log-centric systems such as scribe or Flume, Kafka provides the same efficient performance and higher durability guarantees due to replication, as well as lower end-to-end latency.
5. Stream Processing
This scenario may be more and better understood. Save the collection stream data to provide a post-docking storm or other streaming computing framework for processing. Many users will process the data from the original topic, summarize it, expand it, or convert it in other ways to the new topic and proceed with the subsequent processing. For example, a recommended process for an article is to crawl the content of an article from an RSS feed, and then throw it into a topic called "article," which may need to be cleaned up, such as restoring normal data or deleting duplicate data, and then returning the results of the content match to the user. This is in addition to a separate topic, which produces a series of processes for real-time data processing. Strom and Samza are very well-known frameworks for implementing this type of data conversion.
6. Event Source
An event source is an application-design approach in which state transitions are recorded as chronological sequence of records. Kafka can store a large amount of log data, which makes it an excellent backstage for applications in this way. such as dynamic aggregation (news Feed).
7. Persistent log (commit log)
The Kafka can serve a distributed system of external persistent logs. Such logs can back up data between nodes and provide a mechanism for resynchronization of failed node data replies. The log compression feature in Kafka provides a condition for this usage. In this usage, Kafka is similar to the Apache Bookkeeper project.
Key points of Kayka design:
1, directly using the Linux file system cache, to efficiently cache data.
2, using Linux zero-copy to improve the transmission performance. The traditional data transmission needs to send 4 times the context switch, after using the Sendfile system call, the data directly in the kernel state exchange, the system context switch reduces to 2 times. Depending on the test results, you can increase the data delivery performance by 60%. Zero-copy detailed technical details can be consulted: https://www.ibm.com/developerworks/linux/library/j-zerocopy/
3. The cost of data access on disk is O (1). Kafka with topic for message management, each topic contains multiple parts (ition), each part corresponds to a logical log, with multiple segment. Each segment stores multiple messages (see), the message ID is determined by its logical location, that is, from the message ID can be directly located to the location of the message storage, avoid the ID-to-location additional mapping. Each part corresponds to an index in memory, recording the first message offset in each segment. Messages sent to a topic by the Publisher are distributed evenly across multiple part (randomly or based on user-specified callback functions), and the broker receives a publish message to add the message to the last segment of the corresponding part. When the number of messages on a segment reaches the configured value or the message is published longer than the threshold, the message on segment is flush to disk, and only the message Subscribers flush to disk can subscribe to it, and segment will not write the data to that segment after reaching a certain size , the broker creates a new segment.
4, explicit distribution, that is, all producer, broker and Consumer will have multiple, are distributed. There is no load balancing mechanism between producer and broker. The zookeeper is used for load balancing between broker and consumer. All broker and consumer are registered in zookeeper, and zookeeper will save some of their metadata information. If a broker and consumer have changed, all other brokers and consumer will be notified.
Resources:
- Apache Kafka website
- Project design Discussion
- GitHub Mirror
- Morten Kjetland's introduction to Apache Kafka
- Comparison of Quora and RABBITMQ
- KAFKA:A distributed Messaging System for LOG processing
- Zero-copy principle
- Kafka and Hadoop
Distributed messaging system: Kafka