Distributed message system: Kafka and message kafka

Source: Internet
Author: User

Distributed message system: Kafka and message kafka

Kafka is a distributed publish-subscribe message system. It was initially developed by LinkedIn and later became part of the Apache project. Kafka is a distributed, partitioned, and persistent Log service with redundant backups. It is mainly used to process active streaming data.

In big data systems, we often encounter a problem. Big Data is composed of various subsystems, and data needs to be continuously transferred in various subsystems with high performance and low latency. Traditional Enterprise message systems are not very suitable for large-scale data processing. In order to achieve simultaneous processing of online applications (messages) and offline applications (data files, logs) Kafka. Kafka can play two roles:

Kafka architecture:

 

The overall architecture of Kafka is very simple. It is an explicit distributed architecture. There can be multiple producer, broker (kafka) and consumer. Producer and consumer implement the Kafka registration interface. Data is sent from the producer to the broker, and the broker plays an intermediate cache and distribution role. The broker distributes and registers the consumer to the system. The role of broker is similar to caching, that is, caching between active data and offline processing systems. The communication between the client and the server is based on a simple, high-performance TCP protocol unrelated to programming languages. Several Basic concepts:

Message sending process:

 

Kafka design:

1. Throughput

High throughput is one of the core objectives of kafka. Therefore, kafka has made the following designs:

Server Load balancer

Pull System

Since the kafka broker persists data and the broker has no memory pressure, consumer is very suitable for consuming data in the pull mode and has the following benefits:

Scalability

When you need to add a broker endpoint, the newly added broker will register with zookeeper, and the producer and consumer will perceive these changes based on the watcher registered on zookeeper and make timely adjustments.

Application scenarios of Kayka:

1. Message Queue

Compared with most message systems, Kafka provides better throughput, built-in partitioning, redundancy, and fault tolerance, which makes Kafka a good solution for large-scale message processing applications. Generally, the throughput of a message system is relatively low, but it requires less end-to-end latency. It depends on the powerful durability guaranteed by Kafka. In this field, Kafka is comparable to traditional messaging systems such as ActiveMR or RabbitMQ.

2. Behavior tracking

Another Application Scenario of Kafka is to track users' browsing pages, searches, and other behaviors, and record them to the corresponding topic in the publish-subscribe mode in real time. After these results are obtained by the subscriber, they can be further processed in real time, monitored in real time, or processed in hadoop/offline data warehouse.

3. Metadata monitoring

It is used as a monitoring module for Operation records, that is, collecting and recording some operation information, which can be understood as O & M data monitoring.

4. Log collection

In log collection, there are actually many open-source products, including Scribe and Apache Flume. Many users use Kafka instead of log aggregation ). Log aggregation generally collects log files from the server and stores them in a centralized location (File Server or HDFS) for processing. However, Kafka ignores the file details and abstracts them into a log or event message stream. This reduces the processing latency of Kafka and makes it easier to support multiple data sources and distributed data processing. Compared with log-centric systems such as Scribe or Flume, Kafka provides the same efficient performance and higher durability assurance due to replication, as well as lower end-to-end latency.

5. Stream Processing

This scenario may be many and understandable. Save the collected stream data to provide the Storm or other stream computing frameworks for processing. Many users process, aggregate, expand, or convert the data from the original topic to the new topic in other ways before continuing the subsequent processing. For example, the processing process of an article recommendation may be to capture the content of the article from the RSS data source and then drop it into a topic called "article; subsequent operations may require cleaning up the content, such as replying to normal data or deleting duplicate data, and returning matching results to the user. In addition to an independent topic, a series of real-time data processing processes are generated. Strom and Samza are well-known frameworks for implementing this type of data conversion.

6. Event Source

An event source is an application design method in which state transfer is recorded as a chronological sequence of records. Kafka can store a large amount of log data, which makes it an excellent backend for such applications. For example, News feed ).

7. Persistent log (commit log)

Kafka can be used as a distributed system with external persistent logs. This log can back up data between nodes and provide a re-synchronization mechanism for data recovery from faulty nodes. The log compression function in Kafka provides conditions for this usage. In this usage, Kafka is similar to the Apache BookKeeper project.

Key Points of Kayka design:

1. directly use the linux file system cache to efficiently cache data.

2. Use linux Zero-Copy to improve the sending performance. The traditional data transmission requires four context switches. After the sendfile system is called, the data is directly switched in kernel mode, and the system context switches are reduced to two. Based on the test results, the data sending performance can be improved by 60%. Zero-Copy detailed technical details can be referred to: https://www.ibm.com/developerworks/linux/library/j-zerocopy/

3. The cost of data access to the disk is O (1 ). Kafka manages messages by topic. Each topic contains multiple parts (ition), each part corresponds to a logic log, and multiple segments are formed. Multiple messages are stored in each segment (see figure). The message id is determined by its logical location. That is, the Message id can be directly located at the storage location of the message to avoid additional ing from the id to the location. Each part corresponds to an index in the memory and records the first message offset in each segment. Messages sent by the publisher to a topic are evenly distributed to multiple parts (randomly or based on the callback function specified by the user ), when the broker receives a publish message and adds the message to the last segment of the corresponding part, when the number of messages on a segment reaches the configured value or the message publishing time exceeds the threshold, messages on the segment will be flushed to the disk. Only message subscribers on the disk can subscribe to messages. After the segments reach a certain size, no data will be written to the segments, the broker creates a new segment.

4. Explicit distribution: all producer, broker, and consumer are distributed. There is no load balancing mechanism between Producer and broker. The broker and consumer use zookeeper for load balancing. All brokers and consumers are registered in zookeeper, and zookeeper stores some of their metadata information. If a broker and consumer change, all other brokers and consumers will be notified.

Source code source: http://minglisoft.cn/technology

If you are interested, you can go to the ball ~ Sharing learning technologies: 2042849237

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.