Kafka learning Summary

Last Update:2016-11-14 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Kafka is a distributed Message System Based on publishing and subscription. It has the following features.

1. Provides message persistence and access performance for a constant time.

2. high throughput. A cheap commercial machine can transmit up to messages per second.

3. Supports message partitions, distributed consumption, and ordered messages in the Kafka server.

4. Supports horizontal scaling.

5. Supports offline data processing and real-time data processing.

Kafka Architecture

The topology of Kafka:

1. Producer: Message producer.

2. Consumer: Message consumer.

3. Broker: a Kafka cluster consists of one or more servers. A server is called a broker. The message is sent by the producer to the broker. The consumer consumes messages from the borker.

4. toptic: Message topic. Each message sent to the Kafka cluster has a topic. Messages of different topics are physically stored separately. Messages of a topic are stored on one or more brokers.

5. Partition: Message partition. Each topic includes one or more partitions.

6. consumer group: each consumer belongs to a specific group. You can specify a group name for each consumer. If this parameter is not specified, it belongs to the default group.

Kafka Topology

I can see. A Kafka cluster consists of several producers, consumer grouper, broker, and zookeeper. Kafka manages cluster configurations through zookeeper and reblance when the consumer changes.

TOPIC & partion

A topic can be logically understood as a queue. A message must specify its topic. It can be understood as a queue in which the message must be placed. To improve the throughput of Kafka, topics are physically divided into one or more partitions. Each partition corresponds to a folder physically. This folder stores messages and index files under this partition.

If you create two topics, topic1 and topic2, each topic pair should have 13 and 19 partitions, with a total of 8 nodes in the cluster, 32 folders will be created in the cluster. As shown in:

Each log file is a log entry sequence. Each log entry sequence contains a four-byte integer value (message length, 1 + 4 + n), one-byte magic value, A four-byte CRC verification code consisting of n Bytes of message body length. Each message has a unique 64-byte offset under the current partition. It specifies the storage location of messages. The storage format of messages on the disk is as follows:

message length ： 4 bytes (value: 1+4+n)"magic" value ： 1 byte crc ： 4 bytes payload ： n bytes

This log entries is not composed of a file, but is divided into multiple segments. Each segment is named after the offset of the first message under the segment and suffixed with Kafka. There will also be an index file, which indicates each
The offset range of log entry under segment, as shown in:



A very important guarantee for the high throughput of Kafka is that messages are written to partition in sequence. As shown in:



For traditional message systems, messages that have been consumed are usually deleted, and Kafka stores the consumed messages. In addition, two deletion policies are provided for consumed messages based on the consumption time of messages and the size of the partition file.
We can use the configuration file $ kafka_home/config/server. properties to allow Kafka to delete data from a week ago, or delete old data when the partition file exceeds 1 GB. The configuration is as follows:

# The minimum age of a log file to be eligible for deletionlog. retention. hours = 168 # the maximum size of a log segment file. when this size is reached a new log segment will be created. log. segment. bytes = 1073741824 # The interval at which log segments are checked to see if they can be deleted according to the retention policieslog. retention. check. interval. ms = 300000 # If log. cleaner. enable = true is set the cleaner will be enabled and individual logs can then be marked for log compaction. log. cleaner. enable = false

For Messages consumed by consumer, the offset of messages is controlled by consumer. For Kafka, messages are stateless. Kafka does not guarantee that a message is consumed only by one consumer of the consumer group,
The lock mechanism is not required, which is also an important guarantee for the high throughput of Kafka.

Push & Pull

Kafka uses the push mechanism to push messages to the broker. The pull mechanism is used to consume messages. The push and pull mechanisms have advantages and disadvantages. Kafka uses the pull mechanism to consume messages, which can simplify the broker design. The push mechanism delivers messages as soon as possible,
This may cause the consumer to be unable to process the message, resulting in network congestion or denial of service. The consumer can control when to consume the message. You can consume data in batches and consume data one by one. You can select different submission methods to implement different transmission semantics.

The Guarantee Mechanism of message delivery.

At most once messages may be lost, but will never be transmitted again
At least one messages will never be lost, but may be transmitted repeatedly.
Exactly once each message is certainly transmitted once and only once. In many cases, this is what the user wants.

When the producer sends a message to the broker, once the message is committed, it will not be lost because of the existence of replication. However, if the communication is interrupted due to network problems after the producer sends data to the broker, the producer cannot determine whether the message has been committed. Although Kafka cannot determine what happened during a network failure, the producer can generate something similar to a primary key. In the event of a failure, the power is equivalent to multiple retries, so that exactly once is achieved. Up to now (Kafka 0.8.2,), this feature has not yet been implemented and is expected to be implemented in future Kafka versions. (Therefore, by default, at least once is ensured for the next message from producer to broker, and at most once can be implemented by setting the asynchronous sending of producer ).

After reading the message, commit is used to process the message. In this mode, if the consumer has not had time to process the message after the commit, the crash will fail to read the message that has just been submitted but not processed after the next restart, this corresponds to at most once
After reading the message, process the message and then commit it. In this mode, if the consumer crash is completed before the commit is completed after the message is processed, the message that has not been commit will be processed the next time you start to work. In fact, the message has already been processed. This corresponds to at least once. In many use cases, messages have a primary key, so message processing is often idempotent. That is, it is equivalent to processing a message multiple times only once, it can be considered as exactly once. (I think this statement is far-fetched. After all, it is not a mechanism provided by Kafka itself, and the primary key itself cannot completely guarantee the idempotence of the operation. In fact, we say that the delivery guarantee semantics is to discuss how many times are processed, not what the processing results are like, because there are various processing methods, we should not regard the features of the processing process, such as idempotence, as the feature of Kafka itself)
To achieve exactly once, you must coordinate the offset and the output of the actual operation. The typical practice is to introduce two-phase commit. If the offset and operation input can exist in the same place, it is more concise and common. This method may be better because many output systems may not support two-phase commit. For example, after the consumer obtains the data, it may put the data in HDFS. If the latest offset is written to HDFS together with the data itself, the data output and offset update can be ensured to be completed either, either not, or indirectly implement exactly once. (Currently, for high level APIs, offset is stored in zookeeper and cannot be stored in HDFS, while the offset of low level APIS is maintained by itself and can be stored in HDFS)

In short, Kafka guarantees at least once by default, and allows at most once by setting the producer asynchronous commit. Exactly once requires collaboration with external storage systems. Fortunately, the offset provided by Kafka can be very direct and easy to use.

Kafka learning Summary

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Kafka learning Summary

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Kafka learning Summary

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support