Apache Kafka: Next Generation distributed messaging system

Source: Internet
Author: User
Tags sendfile zookeeper rabbitmq

"Http://www.infoq.com/cn/articles/apache-kafka/"
Distributed publish-Subscribe messaging system.

Kafka is a fast, extensible, design-only, distributed, partitioned, and replicable commit log service.

Apache Kafka differs from traditional messaging systems in the following ways:
It is designed as a distributed system that is easy to scale out;
It also provides high throughput for both publishing and subscriptions;
It supports multi-subscribers and can automatically balance consumers when they fail.
It persists messages to disk, so it can be used in bulk consumption, such as ETL, and real-time applications.


This article will focus on the architecture, features, and features of Apache Kafka to help us understand why Kafka is better than traditional messaging services.
I will compare the features of Kafak and traditional messaging services RABBITMQ, Apache Activemq, and discuss some Kafka scenarios that are better than traditional messaging services. In the last section, we will explore an ongoing example application that demonstrates the purpose of Kafka as a messaging server. This example applies the full source code on GitHub. A detailed discussion of it is in the last section of this document.

Schema
Topic (TOPIC) is a specific type of message flow. The message is a payload of bytes (Payload), and the topic is the name of the category or seed (Feed) of the message.
A producer (Producer) is any object that can publish a message to a topic.
Published messages are saved in a set of servers, which are referred to as Agent (broker) or Kafka clusters.
Consumers can subscribe to one or more topics and pull data from the broker to consume these published messages.

Figure 1:kafka Producer, consumer and agent environment

Producers can choose the serialization method they prefer to encode the message content. To improve efficiency, a producer can send a set of messages in a publishing request. The following code shows how to create a producer and send a message.

Producer Sample Code:
Producer = new producer (...);
message = new Message ("Test Message str". GetBytes ());
set = new Messageset (message);
Producer.send ("Topic1", set);

To subscribe to the topic, consumers first create one or more message flows for the topic. Messages posted to this topic will be distributed evenly to these streams. Each message flow provides an iterative interface for continuously generated messages. The consumer then iterates through each message in the stream, processing the payload of the message. Unlike traditional iterators, message flow iterators never stop. If no message is currently present, the iterator will block until a new message is posted to the topic.

Kafka supports a point-to-dot distribution model (point-to-point delivery model), a single copy of a message in multiple consumer co-consumption queues, and a publish-subscribe model (Publish-subscribe models). That is, multiple consumers receive their own copy of the message. The following code shows how the consumer uses the message.


Consumer Sample code:
Streams[] = Consumer.createmessagestreams ("Topic1", 1)
for (Message:streams[0]) {
bytes = Message.payload ();
Do something with the bytes
}


The overall architecture of Kafka is shown in 2. Because Kafka is inherently distributed, a Kafka cluster typically consists of multiple agents. To balance the load, the topic is divided into partitions, with each agent storing one or more partitions. Multiple producers and consumers can produce and get messages at the same time.

"Kafka Storage"
The Kafka storage layout is simple. Each partition of the topic corresponds to a logical log. Physically, a log is a set of fragmented files of the same size. Each time a producer publishes a message to a partition, the agent appends the message to the last segment file. When the number of messages posted reaches the set value or after a certain amount of time, the segment file is actually written to disk. When the write is complete, the message is exposed to the consumer.

Unlike traditional messaging systems, messages stored in the Kafka system do not have a clear message ID.

The message is exposed through the logical offset in the log. This avoids the overhead of maintaining a companion dense addressing that maps the random-access index structure of the message ID to the actual message address. The message ID is incremental, but not contiguous. To calculate the ID of the next message, you can add the length of the current message based on its logical offset.

Consumers always get the message sequentially from a particular partition, and if the consumer knows the offset of a particular message, it means that the consumer has consumed all the previous messages. The consumer sends an asynchronous pull request to the proxy and prepares the byte buffer for consumption. Each asynchronous pull request contains the message offset to consume. Kafka uses the Sendfile API to efficiently distribute bytes to consumers from the agent's log segment files.

"Kafka Agent"
Unlike other messaging systems, the Kafka agent is stateless. This means that consumers must maintain the status information they have consumed. This information is maintained by the consumer itself and the agent does not care. This design is very subtle and it contains innovations in itself.

Removing a message from the agent becomes tricky because the agent does not know whether the consumer has already used the message. Kafka creatively solves this problem by applying a simple time-based SLA to a retention policy. When a message exceeds a certain amount of time in the agent, it is automatically deleted.
This innovative design has great benefits that consumers can deliberately pour back into the old offset to consume data again. This violates the common conventions of the queue, but proves to be a basic feature of many consumers.

"Zookeeper and Kafka"
Consider a distributed system with multiple servers, each of which is responsible for saving data and performing operations on the data. Such potential examples include distributed search engines, distributed build systems, or known systems such as Apache Hadoop. A common problem with all these distributed systems is how you determine which servers are alive and at work at any point in time. Most importantly, how do you do this reliably when faced with these distributed computing challenges, such as network failures, bandwidth throttling, variable latency connections, security issues, and any network environment that can occur even when there are errors across multiple datacenters.

These are the concerns of Apache zookeeper, a fast, highly available, fault-tolerant, distributed coordination service. You can use zookeeper to build reliable, distributed data structures for group members, leader elections, collaborative workflow and configuration services, and generalized distributed data structures such as locks, queues, barriers (Barrier), and latches (Latch). Many well-known and successful projects depend on zookeeper, including HBase, Hadoop 2.0, Solr Cloud, neo4j, Apache Blur (incubating), and Accumulo.

Zookeeper is a distributed, hierarchical file system that facilitates loose coupling between clients and provides a consistent, Znode view similar to files and directories in traditional file systems. It provides basic operations, such as creating, deleting, and checking the existence of Znode. It provides an event-driven model in which the client can observe changes to a particular znode, such as an existing znode adds a new child node. Zookeeper runs multiple zookeeper servers, called Ensemble, for high availability. Each server holds a copy of the Distributed File system's memory and serves the client's read requests.

4 shows a typical zookeeper ensemble, a server as leader, and others as follower. When ensemble is started, leader is selected, and all follower replicate leader status. All write requests are routed through leader, and the changes are broadcast to all follower. Change broadcasts are called atomic broadcasts.

Use of zookeeper in Kafka: Just as zookeeper is used for coordination and promotion of distributed systems, Kafka uses zookeeper for the same reason. Zookeeper is used to manage and coordinate Kafka agents. Each Kafka agent coordinates other Kafka agents through zookeeper. When a new agent is added to the Kafka system or an agent fails to fail, the zookeeper service notifies the producer and the consumer. The producers and consumers began to coordinate their work with other agents accordingly. Kafka is shown in overall system architecture 5.


"Apache Kafka compared to other messaging services"
Let's take a look at two projects using Apache Kafka to compare other messaging services. These two projects are LinkedIn and my project, respectively:

LinkedIn's research

The LinkedIn team did an experimental study comparing the performance of Kafka with Apache ActiveMQ V5.4 and Rabbitmq V2.4. They use the ACTIVEMQ default message persistence library KAHADB. LinkedIn runs their experiments on two Linux machines, each configured with 8 cores of 2GHz, 16GB of memory, and 6 disks using RAID10. Two machines are connected via a 1GB network. One machine as an agent and the other as a producer or consumer.

"Producer Testing"

The LinkedIn team configures the agent on all systems and asynchronously brushes messages into its persistence library. For each system, run a producer and publish a total of 10 million messages, each message 200 bytes. Kafka producers send messages in batches of 1 and 50. Activemq and RABBITMQ seem to have no easy way to send messages in bulk, and LinkedIn assumes that it has a batch value of 1. The results are shown in Figure 6 below:

The results of producer performance experiment of Figure 6:linkedin

The main causes of Kafka performance are:

Kafka does not wait for the agent to confirm that it sends the message at the fastest speed the agent can handle.
The Kafka has a more efficient storage format . On average, Kafka each message has a 9-byte overhead, while Activemq has 144 bytes. This is due to the heavy message headers required for JMS and the overhead of maintaining various index structures.

LinkedIn notes that activemq a busiest thread spends most of its time accessing b-tree to maintain message metadata and state.


"Consumer testing"

To do consumer testing, LinkedIn uses a consumer to get a total of 10 million messages. LinkedIn allows all systems to pre-fetch approximately the same amount of data per pull request, up to 1000 messages or 200KB. Set the consumer confirmation model to ACTIVEMQ and Rabbitmq,linkedin automatically. As shown in result 7.

Figure 7:linkedin Consumer performance test results

The main causes of Kafka performance are:

Kafka has a more efficient storage format ; in Kafka, fewer bytes are transferred from the agent to the consumer.
The agents in the ACTIVEMQ and RABBITMQ two containers must maintain the transmission status of each message . The LinkedIn team noticed that one of the ACTIVEMQ threads was writing the KAHADB page to disk during the testing process. In contrast, the Kafka agent has no disk write action .

Finally, Kafka reduces the transport overhead by using the Sendfile API .


Currently, I am working on a project that provides real-time services to quickly and accurately extract OTC market pricing content from messages. This is a very important project to handle financial information on nearly 25 asset classes, including bonds, loans and abs (asset-backed securities). The original source of information for the project covers major financial markets in Europe, North America, Canada and Latin America. Here are some statistics on this project, explaining how important it is to include an efficient distributed messaging service in the solution:

More than 1,300,000 messages processed per day;
The number of OTC prices analyzed daily exceeds 12,000,000;
Support for more than 25 asset classes;
More than 70,000 independent bills resolved per day.
Messages contain PDFs, Word documents, Excel, and other formats. OTC pricing may also be withdrawn from the annex.

Because of the performance limitations of traditional messaging servers, when processing large attachments, Message Queuing became very large and our project faced serious problems, and jmsqueue needed to start 2-3 times a day. Restarting the JMS queue may lose all messages in the queue. The project needs a framework that can keep messages regardless of the behavior of the parser (consumer).

The Kafka features are well suited to the needs of our projects.

Features of the current project:

Use Fetchmail to obtain a remote mail message, which is then filtered and processed by procmail, such as distributing an attachment-based message separately.
Each message is fetched from a separate file, which is processed (read and deleted) for a message to be inserted into the messaging server.
The message content is fetched from the message service queue for parsing and extracting information.


Sample App
This sample app is based on the modified version of the original app that I used in the project. I have removed the use of logs and multithreading features so that the sample application artifacts are as simple as possible. The purpose of the sample app is to show how to use APIs from Kafka producers and consumers. Applications include a producer example (simple producer code, a message demonstrating Kafka producer API usage and publishing a specific topic), a consumer sample (simple consumer code that demonstrates the usage of the Kafka consumer API), and a message content generation API ( The API to generate the message content to the file under a specific path). Shows the components and their relationship to other components in the system.

Figure 8: Sample Application Component Architecture

The structure of the sample application is similar to the example program in Kafka source code. The app's source code contains the Java Source program folder ' src ' and ' config ' folders, which include several configuration files and some shell scripts for executing the sample application. To run the sample app, refer to the Readme.md file or the GitHub website wiki page for instructions.

The program can be built using Apache Maven, which is also easy to customize. If someone wants to modify or customize the code for the sample app, several Kafka build scripts have been modified to reconstruct the sample application code. A detailed description of how to customize the sample app is already on the project GitHub wiki page.

Now, let's look at the core artifacts of the sample application.

Kafka Producer code example

/**
* Instantiates a new Kafka producer.
*
* @param topic the topic
* @param directorypath the directory path
*/
Public kafkamailproducer (String topic, string directorypath) {
Props.put ("Serializer.class", "Kafka.serializer.StringEncoder");
Props.put ("Metadata.broker.list", "localhost:9092");
Producer = new Kafka.javaapi.producer.producer<integer, string> (new Producerconfig (props));
this.topic = topic;
This.directorypath = DirectoryPath;
}

public void Run () {
Path dir = paths.get (DirectoryPath);
try {
New Watchdir (dir). Start ();
New ReadDir (dir). Start ();
} catch (IOException e) {
E.printstacktrace ();
}
}
The code snippet above shows the basic usage of the Kafka producer API, such as setting the properties of the producer, including which message to post, which serialization class to use, and information about the agent. The basic function of this class is to read the mail message file from the Mail directory and then publish it as a message to the Kafka agent. The directory is monitored by the Java.nio.WatchService class, and once a new mail message is sent to the directory, it is immediately read and published as a message to the Kafka agent.

Kafka Consumer code example

Public Kafkamailconsumer (String topic) {
Consumer =
Kafka.consumer.Consumer.createJavaConsumerConnector (Createconsumerconfig ());
this.topic = topic;
}

/**
* Creates the consumer config.
*
* @return The consumer config
*/
private static Consumerconfig Createconsumerconfig () {
Properties Props = new properties ();
Props.put ("Zookeeper.connect", kafkamailproperties.zkconnect);
Props.put ("Group.id", kafkamailproperties.groupid);
Props.put ("zookeeper.session.timeout.ms", "400");
Props.put ("zookeeper.sync.time.ms", "200");
Props.put ("auto.commit.interval.ms", "1000");
return new Consumerconfig (props);
}

public void Run () {
map<string, integer> topiccountmap = new hashmap<string, integer> ();
Topiccountmap.put (topic, New Integer (1));
Map<string, list<kafkastream<byte[], byte[]>>> consumermap = Consumer.createmessagestreams ( TOPICCOUNTMAP);
Kafkastream<byte[], byte[]> stream = Consumermap.get (topic). get (0);
Consumeriterator<byte[], byte[]> it = Stream.iterator ();
while (It.hasnext ())
System.out.println (New String (It.next (). Message ()));
}
The code above demonstrates the basic consumer API. As we mentioned earlier, consumers need to set up a message flow for consumption. In the Run method, we set up and print the received message in the console. In my project, we input it into the parsing system to extract the OTC pricing.

In the current quality assurance system, we use Kafka as the messaging server for the concept verification (Proof of CONCEPT,POC) project, which has a better overall performance than the JMS messaging service. One of the features we are very excited about is the re-consumption of messages (Re-consumption), which allows our parsing system to re-parse certain messages according to business requirements. Based on the good results of Kafka, we are planning to use it instead of using the Nagios system to do log aggregation and analysis .

Summarize
Kafka is a new type of system for processing large amounts of data . the Kafka-based consumption model allows consumers to process messages at their own pace . If an exception occurs while processing the message, the consumer can always choose to consume the message .

Apache Kafka: Next Generation distributed messaging system

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.