Apache Kafka: the next generation distributed Messaging System

Source: Internet
Author: User
Tags sendfile rabbitmq

Apache Kafka: the next generation distributed Messaging System
Introduction

Apache Kafka is a distributed publish-subscribe message system. It was initially developed by LinkedIn and later became part of the Apache project. Kafka is a fast and scalable Log service that is designed internally to be distributed, partitioned, and replicated.

Compared with traditional message systems, Apache Kafka has the following differences:

  • It is designed as a distributed system and is easy to scale out;
  • It also provides high throughput for publishing and subscription;
  • It supports multiple subscribers and automatically balances consumers when a failure occurs;
  • It persists messages to disks and can be used for batch consumption, such as ETL and real-time applications.

This article focuses on the architecture, features, and features of Apache Kafka and helps us understand why Kafka is better than traditional message service.

I will compare the features of Kafak and traditional Message Service RabbitMQ and Apache ActiveMQ, and discuss some scenarios where Kafka is better than traditional message service. In the last section, we will discuss an example application in progress to demonstrate the use of Kafka as a message server. The complete source code of this example application is on GitHub. A detailed discussion of it is in the last section of this article.

Architecture

First, let me introduce the basic concepts of Kafka. Its architecture includes the following components:

  • A Topic is a message stream of a specific type. A message is a Payload of bytes, and the topic is the message category name or Feed name.
  • A Producer is any object that can publish messages to a topic.
  • Published messages are stored in a group of servers called Broker or Kafka clusters.
  • Consumers can subscribe to one or more topics and pull data from the Broker to consume these published messages.

 

Figure 1: Kafka producer, consumer, and proxy environment

The producer can select its preferred serialization method to encode the message content. To improve efficiency, the producer can send a group of messages in a publishing request. The following code demonstrates how to create a producer and send messages.

Sample producer code:

producer = new Producer(…); message = new Message(“test message str”.getBytes()); set = new MessageSet(message); producer.send(“topic1”, set); 

To subscribe to a topic, the consumer first creates one or more message streams for the topic. Messages published to this topic are evenly distributed to these streams. Each message stream provides an iterative interface for continuously generated messages. Then, the consumer iterates each message in the stream to process the message's effective load. Unlike traditional iterators, message stream iterators never stop. If no message exists, the iterator will block until a new message is published to this topic. Kafka also supports the Point-to-point delivery model (Point-to-point delivery model), that is, multiple consumers consume a single copy of a message in a queue, and the Publish-subscribe model (Publish-subscribe model), that is, multiple consumers receive their own message copies. The following code demonstrates how consumers use messages.

Sample Consumer Code:

streams[] = Consumer.createMessageStreams(“topic1”, 1) for (message : streams[0]) { bytes = message.payload(); // do something with the bytes } 

The overall architecture of Kafka 2 is shown in. Because Kafka is distributed internally, a Kafka cluster usually includes multiple proxies. To balance the load, the topic is divided into multiple partitions, and each proxy stores one or more partitions. Multiple producers and consumers can simultaneously produce and obtain messages.

Figure 2: Kafka Architecture

Kafka Storage

The storage layout of Kafka is very simple. Each partition of the topic corresponds to a logical log. Physically, a log is a group of segmented files of the same size. Each time the producer publishes a message to a partition, the proxy appends the message to the last file segment. When the number of published messages reaches the set value or after a certain period of time, the segment files are actually written to the disk. After the data is written, the message is published to the consumer.

Unlike traditional message systems, messages stored in Kafka do not have a clear message Id.

Messages are published using the logical offset in the log. This avoids the overhead of maintaining intensive addressing and is used to map the Random Access Index Structure of the Message ID to the actual message address. The Message ID is incremental but not continuous. To calculate the ID of the next message, you can add the length of the current message based on the Logical offset.

The consumer always obtains messages from a specific partition in sequence. If the consumer knows the offset of a specific message, it means that the consumer has consumed all the previous messages. The consumer sends an asynchronous pull request to the proxy and prepares the byte buffer for consumption. Each asynchronous pull request contains the offset of the message to be consumed. Kafka uses the sendfile API to efficiently distribute bytes from the agent's log segment files to consumers.

Figure 3: Kafka Storage Architecture

Kafka proxy

Unlike other message systems, the Kafka agent is stateless. This means that the consumer must maintain the consumed status information. The information is maintained by the consumer, and the agent does not care about it. This design is very subtle and involves innovation.

  • Deleting a message from a proxy becomes tricky because the proxy does not know whether the consumer has used the message. Kafka innovatively solves this problem by applying a simple time-based SLA to a retention policy. Messages are automatically deleted when they have been in the proxy for more than a certain period of time.
  • This innovative design has great benefits. Consumers can deliberately return to the old offset to consume data again. This violates common queuing conventions, but has proved to be a basic feature of many consumers.
ZooKeeper and Kafka

Consider a distributed system with multiple servers. Each server is responsible for storing data and performing operations on the data. Such potential examples include distributed search engines, distributed build systems, or known systems such as Apache Hadoop. A common problem with all these distributed systems is how you determine which servers are active and working at any point in time. Most importantly, when faced with these distributed computing challenges, such as network failures, bandwidth restrictions, variable latency connections, security issues, and any network environment, how can you reliably perform errors that may occur across multiple data centers. These are the concerns of Apache ZooKeeper. It is a fast, highly available, fault-tolerant, and Distributed Coordination Service. You can use ZooKeeper to build a reliable and distributed data structure for group members, leadership election, collaborative workflow, and Configuration Services, and broadly distributed data structures such as locks, queues, Barrier and Latch ). Many well-known and successful projects depend on ZooKeeper, including HBase, Hadoop 2.0, Solr Cloud, Neo4J, Apache Blur (Incubating), and Accumulo.

ZooKeeper is a distributed, hierarchical file system that promotes loose coupling between clients and provides eventual consistency, similar to the Znode view of files and directories in traditional file systems. It provides basic operations, such as creating, deleting, and checking whether Znode exists. It provides an event-driven model that allows the client to observe changes to a specific Znode. For example, a new subnode is added to an existing Znode. ZooKeeper runs multiple ZooKeeper servers, which are called Ensemble for high availability. Each server holds a copy of the memory of the Distributed File System to provide services for client read requests.

Figure 4: ZooKeeper Ensemble Architecture

4. The typical ZooKeeper ensemble is displayed. One server serves as the Leader and the others as the Follower. When Ensemble is started, the Leader is selected first, and then all Follower copies the Leader status. All write requests are routed by the Leader. The changes are broadcast to all Follower. Change broadcast is called atomic broadcast.

The purpose of ZooKeeper in Kafka: just as ZooKeeper is used for coordination and promotion of Distributed Systems, ZooKeeper is used by Kafka for the same reason. ZooKeeper is used to manage and coordinate Kafka proxies. Each Kafka proxy coordinates other Kafka proxies through ZooKeeper. When a new proxy or a proxy fails in the Kafka system, the ZooKeeper service notifies the producer and consumer. The producer and consumer start to coordinate with other agents accordingly. The overall system architecture of Kafka is shown in Figure 5.

Figure 5: overall architecture of the Kafka Distributed System

Comparison between Apache Kafka and other message services

Let's take a look at two projects using Apache Kafka to compare with other message services. The two projects are LinkedIn and my projects:

Research on LinkedIn

The LinkedIn team conducted an experimental study to compare the performance of Kafka with Apache ActiveMQ V5.4 and RabbitMQ V2.4. They use the default message persistence library Kahadb of ActiveMQ. LinkedIn runs their experiment on two Linux machines. Each machine is configured with 8 cores, 2 GHz, 16 GB memory, and six disks use RAID 10. The two machines are connected through a 1 GB network. One machine acts as a proxy and the other as a producer or consumer.

Producer Test

The LinkedIn team configures proxies in all systems and asynchronously injects messages into their persistence libraries. Run a producer for each system to publish 10 million messages in 200 bytes. The Kafka producer sends messages in batches of 1 and 50. ActiveMQ and RabbitMQ do not seem to have a simple way to send messages in batches. LinkedIn assumes that the batch value is 1. The results are shown in figure 6 below:

Figure 6: Result of LinkedIn's producer performance experiment

The main reasons for better Kafka performance include:

  • Kafka sends messages as quickly as the proxy can process without waiting for confirmation from the proxy.
  • Kafka has a more efficient storage format. On average, each message in Kafka has 9 bytes of overhead, and ActiveMQ has 144 bytes. The reason is that the heavy message header required by JMS and the overhead of maintaining various index structures. LinkedIn noticed that ActiveMQ was the busiest thread most of the time accessing B-Tree to maintain message metadata and status.
Consumer Test

To perform a consumer test, LinkedIn uses a consumer to get a total of 10 million messages. LinkedIn allows all systems to pre-obtain about the same amount of data for each pull request, up to 1000 messages or kb. For ActiveMQ and RabbitMQ, LinkedIn sets the consumer to confirm that the model is automatic. Result 7 is displayed.

Figure 7: LinkedIn consumer performance experiment results

The main reasons for better Kafka performance include:

  • Kafka has a more efficient storage format; In Kafka, fewer bytes are transmitted from the proxy to the consumer.
  • The proxies in the ActiveMQ and RabbitMQ containers must maintain the transmission status of each message. The LinkedIn team noticed that one of the ActiveMQ threads had been writing the KahaDB page to the disk during the test. In contrast, the Kafka agent does not write data to the disk. Finally, Kafka reduces the transmission overhead by using the sendfile API.

Currently, I am working on a project that provides real-time services to quickly and accurately extract OTC market (OTC) pricing content from messages. This is a very important project that handles nearly 25 types of asset-class financial information, including bonds, loans and ABS (asset-backed securities ). The source of original information of the project covers major financial markets in Europe, North America, Canada and Latin America. The following are some statistics of this project, which shows how important the solution includes an efficient distributed Message Service:

  • The number of messages processed each day exceeds 1,300,000;
  • The number of OTC price parsed every day exceeds 12,000,000;
  • Supports more than 25 types of assets;
  • More than 70,000 independent tickets are resolved every day.

The message contains PDF, Word documents, Excel, and other formats. OTC pricing may also need to be extracted from the attachment.

Due to the performance limitations of traditional message servers, message queues become very large when processing large attachments, and our project faces serious problems. JMSqueue needs to be started twice or twice a day. Restarting the JMS queue may cause the loss of all messages in the queue. A project requires a framework that stores messages regardless of the actions of the Parser (consumer. Kafka features are very suitable for our project needs.

Features of the current project:

  1. Fetchmail is used to obtain remote mail messages, and Procmail filters and processes them. For example, attachments-based messages are separately distributed.
  2. Each message is obtained from a separate file. The file is processed (read and deleted) and inserted into the message server as a message.
  3. The message content is obtained from the message service queue for parsing and extracting information.
Example Application

This example application is based on the modified version of the original application I used in the project. I have deleted the log usage and multithreading features to make the sample application as simple as possible. The purpose of the example application is to demonstrate how to use APIs of Kafka producers and consumers. The application includes a producer example (simple producer code, demonstration of Kafka producer API usage and publishing of messages for specific topics) and consumer example (simple consumer code, used to demonstrate Kafka consumer API usage) and the message content generation API (the API that generates the message content to the file in a specific path ). Shows the components and their relationships with other components in the system.

Figure 8: Architecture of the Sample Application Component

The structure of the sample application is similar to that of the sample program in the Kafka source code. The source code of an application contains the 'src' and 'config' folders of the Java source code, which contain several configuration files and Shell scripts for executing the sample application. To run the sample application, see the ReadMe. md file or the GitHub Wiki page.

You can use Apache Maven to build programs, which is easy to customize. If someone wants to modify or customize the sample application code, several Kafka build scripts have been modified and can be used to re-build the sample application code. The detailed description of how to customize the sample application has been placed on the Wiki page of The GitHub project.

Now let's take a look at the core artifacts of the sample application.

Sample Kafka producer code

/**  * Instantiates a new Kafka producer.  *  * @param topic the topic  * @param directoryPath the directory path  */ public KafkaMailProducer(String topic, String directoryPath) {        props.put("serializer.class", "kafka.serializer.StringEncoder");        props.put("metadata.broker.list", "localhost:9092");        producer = new kafka.javaapi.producer.Producer<Integer, String>(new ProducerConfig(props));        this.topic = topic;        this.directoryPath = directoryPath; } public void run() {       Path dir = Paths.get(directoryPath);       try {            new WatchDir(dir).start();            new ReadDir(dir).start();       } catch (IOException e) {            e.printStackTrace();       } } 

The code snippet above demonstrates the basic usage of the Kafka producer API, such as setting producer attributes, including the topic of the message to be published, the serialization class that can be used, and the related information of the proxy. The basic function of this class is to read the mail message file from the mail directory and then publish the message to the Kafka proxy as a message. The directory is monitored through the java. nio. WatchService class. Once a new mail message is dumped to this directory, it is immediately read and published as a message to the Kafka proxy.

Sample Kafka Consumer Code

public KafkaMailConsumer(String topic) {        consumer = Kafka.consumer.Consumer.createJavaConsumerConnector(createConsumerConfig());        this.topic = topic; } /**  * Creates the consumer config.  *  * @return the consumer config  */ private static ConsumerConfig createConsumerConfig() {       Properties props = new Properties();       props.put("zookeeper.connect", KafkaMailProperties.zkConnect);       props.put("group.id", KafkaMailProperties.groupId);       props.put("zookeeper.session.timeout.ms", "400");       props.put("zookeeper.sync.time.ms", "200");       props.put("auto.commit.interval.ms", "1000");       return new ConsumerConfig(props); } public void run() {       Map<String, Integer> topicCountMap = new HashMap<String, Integer>();       topicCountMap.put(topic, new Integer(1));       Map<String, List<KafkaStream<byte[], byte[]>>> consumerMap = consumer.createMessageStreams(topicCountMap);       KafkaStream<byte[], byte[]> stream = consumerMap.get(topic).get(0);       ConsumerIterator<byte[], byte[]> it = stream.iterator();      while (it.hasNext())       System.out.println(new String(it.next().message())); }

The above Code demonstrates the basic consumer API. As we mentioned earlier, consumers need to set the message stream for consumption. In the Run method, we set and print the received message on the console. In my project, we enter it into the resolution system to extract OTC pricing.

In the current quality assurance system, Kafka is used as the message server for the Proof Concept (POC) project, and its overall performance is superior to that of the JMS message service. One of the features that we are very excited about is the re-consumption of messages, which allows our Parsing System to re-Parse certain messages as needed. Based on the good effects of Kafka, we are planning to use it instead of using Nagios for log aggregation and analysis.

Summary

Kafka is a new system for processing large amounts of data. Kafka's pull-based consumption model allows consumers to process messages at their own speed. If an exception occurs during message processing, the consumer can always choose to consume the message.

About the author

Abhishek Sharma is a natural language processing (NLP), machine learning, and parsing programmer for financial products. He provides algorithm design and resolution development for multiple companies. Abhishek's interests include distributed systems, natural language processing, and Big Data Analysis Using Machine algorithms.

 

 

Kafka architecture design of the distributed publish/subscribe message system

Apache Kafka code example

Apache Kafka tutorial notes

Principles and features of Apache kafka (0.8 V)

Kafka deployment and code instance

Introduction to Kafka and establishment of Cluster Environment

For details about Kafka, click here
Kafka: click here

This article permanently updates the link address:

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.