Introduction to distributed message system Kafka

Source: Internet
Author: User

Kafka is a distributed publish-subscribe message system. It was initially developed by LinkedIn and later became part of the Apache project. Kafka is a distributed, partitioned, and persistent Log service with redundant backups. It is mainly used to process active streaming data.

In big data systems, we often encounter a problem. Big Data is composed of various subsystems, and data needs to be continuously transferred in various subsystems with high performance and low latency. Traditional Enterprise message systems are not very suitable for large-scale data processing. In order to achieve simultaneous processing of online applications (messages) and offline applications (data files, logs) Kafka. Kafka can play two roles:

  1. Reduces the complexity of system networking.
  2. To reduce programming complexity, subsystems do not negotiate with each other. Each subsystem plug-in is inserted into a socket, and Kafka acts as a high-speed data bus.
1. Main Features of Kafka:
  1. It also provides high throughput for publishing and subscription. Kafka can produce about 0.25 million messages (50 MB) per second and process 0.55 million messages (110 MB) per second ).
  2. Supports persistent operations. Persistent messages to disks can be used for batch consumption, such as ETL and real-time applications. Data can be persisted to the hard disk and replicated to prevent data loss.
  3. Distributed System, easy to scale out. All producer, broker, and consumer are distributed. Machines can be expanded without downtime.
  4. The status of the message to be processed is maintained on the consumer side, rather than by the server side. Automatic Balancing upon failure.
  5. Supports online and offline scenarios.
2. Kafka architecture:

The overall architecture of Kafka is very simple. It is an explicit distributed architecture. There can be multiple producer, broker (Kafka) and consumer. Producer and consumer implement the Kafka registration interface. Data is sent from the producer to the broker, and the broker plays an intermediate cache and distribution role. The broker distributes and registers the consumer to the system. The role of broker is similar to caching, that is, caching between active data and offline processing systems. The communication between the client and the server is based on a simple, high-performance TCP protocol unrelated to programming languages.

3. Several Basic concepts:
  1. Topic: refers to the different types of message sources processed by Kafka.
  2. Partition: Physical grouping of a topic. A topic can be divided into multiple partitions. Each partition is an ordered queue. Each message in partition is assigned an ordered ID (offset ).
  3. Message: a message is the basic unit of communication. Each producer can publish messages to a topic.
  4. Producers: the process of publishing a message to a topic in Kafka is called producers.
  5. Consumers: a message and data consumer. The process of subscribing to topics and processing published messages is called consumers.
  6. BROKER: cache proxy. One or more servers in the kafa cluster are collectively called brokers.
4. Message sending process:

  1. Based on the specified partition method (round-robin, hash, etc.), the producer publishes messages to the partition of the specified topic.
  2. After the Kafka cluster receives the message sent by the producer, it persists the message to the hard disk and retains the message length (configurable), regardless of whether the message is consumed.
  3. Consumer obtains pull data from the Kafka cluster and controls the offset of the message.
5. Kafka design: 5.1 Throughput

High throughput is one of the core objectives of Kafka. Therefore, Kafka has made the following designs:

  1. Data disk Persistence: messages are not cached in the memory and directly written to the disk. This fully utilizes the sequential read/write performance of the disk.
  2. Zero-copy: reduce Io operation steps
  3. Batch data sending
  4. Data Compression
  5. Topics are divided into multiple partitions to improve parallelism.
5.2 Server Load balancer
  1. The producer sends messages to the specified partition based on the algorithm specified by the user.
  2. Multiple partitions exist. Each partition has its own replica, and each of which is distributed on different broker nodes.
  3. Multiple partitions need to be selected, lead partition is responsible for reading and writing, and zookeeper is responsible for fail over
  4. Manage the dynamic addition and exit of broker and consumer through zookeeper
5.3 Pull System

Since the Kafka broker persists data and the broker has no memory pressure, consumer is very suitable for consuming data in the PULL mode and has the following benefits:

  1. Simplified Kafka Design
  2. Consumer automatically controls the message pulling speed based on consumption capability
  3. The consumer selects the consumption mode based on its own situation, such as batch consumption, repeated consumption, and consumption starting from the end.
5.4 scalability

When you need to add a broker endpoint, the newly added broker will register with zookeeper, and the producer and consumer will perceive these changes based on the watcher registered on zookeeper and make timely adjustments.

5.5 message deletion policy

The difference between Kafka and JMS implementation (activemq) is that even if a message is consumed, the message will not be deleted immediately. the log file will be deleted after a certain period of time according to the Configuration Requirements in the broker. For example, if the log file is retained for 2 days, the file will be cleared two days later, whether or not the message is consumed. kafka uses this simple method to release disk space. in addition, the performance of Kafka is not inferior due to too many log files, so even if a large number of log files are retained, there is no problem.

In Kafka, consumer is responsible for maintaining the consumption records of messages, while the broker does not care about this. This design not only improves the flexibility of the consumer end, but also moderately reduces the complexity of the broker end design; this is different from many JMS prodivers. in addition, the design of message ack in Kafka is quite different from that in JMS. Messages in Kafka are sent to consumer in batches (usually in the unit of Message count or chunk size, after the message is successfully consumed, the offset of the message is submitted to zookeeper, instead of ack. you may have realized that such a "loose" design poses a risk of "losing" messages/"resending messages.

6. application scenarios of Kafka: 6.1 Message Queue

Compared with most message systems, Kafka provides better throughput, built-in partitioning, redundancy, and fault tolerance, which makes Kafka a good solution for large-scale message processing applications. Generally, the throughput of a message system is relatively low, but it requires less end-to-end latency. It depends on the powerful durability guaranteed by Kafka. In this field, Kafka is comparable to traditional messaging systems such as activemr or rabbitmq.

6.2 behavior tracking

Another Application Scenario of Kafka is to track users' browsing pages, searches, and other behaviors, and record them to the corresponding topic in the publish-subscribe mode in real time. After these results are obtained by the subscriber, they can be further processed in real time, monitored in real time, or processed in hadoop/offline data warehouse.

6.3 yuan Information Monitoring

It is used as a monitoring module for Operation records, that is, collecting and recording some operation information, which can be understood as O & M data monitoring.

6.4 log collection

In log collection, there are actually many open-source products, including scribe and Apache flume. Many users use Kafka instead of log aggregation ). Log aggregation generally collects log files from the server and stores them in a centralized location (File Server or HDFS) for processing. However, Kafka ignores the file details and abstracts them into a log or event message stream. This reduces the processing latency of Kafka and makes it easier to support multiple data sources and distributed data processing. Compared with log-centric systems such as scribe or flume, Kafka provides the same efficient performance and higher durability assurance due to replication, as well as lower end-to-end latency.

6.5 stream processing

This scenario may be many and understandable. Save the collected stream data to provide the storm or other stream computing frameworks for processing. Many users process, aggregate, expand, or convert the data from the original topic to the new topic in other ways before continuing the subsequent processing. For example, the processing process of an article recommendation may be to capture the content of the article from the RSS data source and then drop it into a topic called "article; subsequent operations may require cleaning up the content, such as replying to normal data or deleting duplicate data, and returning matching results to the user. In addition to an independent topic, a series of real-time data processing processes are generated. Strom and samza are well-known frameworks for implementing this type of data conversion.

6.6 event Source

An event source is an application design method in which state transfer is recorded as a chronological sequence of records. Kafka can store a large amount of log data, which makes it an excellent backend for such applications. For example, news feed ).

6.7 persistent log (commit log)

Kafka can be used as a distributed system with external persistent logs. This log can back up data between nodes and provide a re-synchronization mechanism for data recovery from faulty nodes. The log compression function in Kafka provides conditions for this usage. In this usage, Kafka is similar to the Apache bookkeeper project.

7. Key Points of Kafka design: 7.1 use the Linux file system cache directly to efficiently cache data. 7.2 use Linux zero-copy to improve the sending performance.

The traditional data transmission requires four context switches. After the sendfile system is called, the data is directly switched in kernel mode, and the system context switches are reduced to two. Based on the test results, the data sending performance can be improved by 60%. Zero-copy detailed technical details can be referred to: https://www.ibm.com/developerworks/linux/library/j-zerocopy/

7.3 The cost for data access on the disk is O (1 ).

Kafka manages messages by topic. Each topic contains multiple parts (ition), each part corresponds to a logic log, and multiple segments are formed. Multiple messages are stored in each segment (see figure). The message ID is determined by its logical location. That is, the Message ID can be directly located at the storage location of the message to avoid additional ing from the ID to the location. Each part corresponds to an index in the memory and records the first message offset in each segment. Messages sent by the publisher to a topic are evenly distributed to multiple parts (randomly or based on the callback function specified by the user ), when the broker receives a publish message and adds the message to the last segment of the corresponding part, when the number of messages on a segment reaches the configured value or the message publishing time exceeds the threshold, messages on the segment will be flushed to the disk. Only message subscribers on the disk can subscribe to messages. After the segments reach a certain size, no data will be written to the segments, the broker creates a new segment.

7.4 explicitly distributed.

That is, all producer, broker, and consumer are distributed. There is no load balancing mechanism between producer and broker. The broker and consumer use zookeeper for load balancing. All brokers and consumers are registered in zookeeper, and zookeeper stores some of their metadata information. If a broker and consumer change, all other brokers and consumers will be notified.

8. References:
  • Apache Kafka website
  • Project Design Discussion
  • GitHub Image
  • Introduction to Apache Kafka by Morten kjetland
  • Comparison with rabbitmq on Quora
  • Kafka: a distributed messaging system for log processing
  • Zero-copy Principle
  • Kafka and hadoop

Some features of Kafka http://blog.segmentfault.com/mongo/1190000000385620

Principle and characteristics of Apache Kafka (0.8 V) http://shift-alt-ctrl.iteye.com/blog/1930345

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.