Talking about distributed message technology Kafka (turn)

Source: Internet
Author: User
Tags crc32

A mysterious program of apes.

Basic introduction of Kafka

Kafka, originally developed by LinkedIn, is a distributed, partitioned, multi-replica, multi-subscriber, zookeeper-coordinated distributed log system (also known as an MQ system) that can be used for Web/nginx logs, access logs, messaging services, etc. LinkedIn contributed to the Apache Foundation and became the top open source project in 2010.

The main application scenarios are: Log collection system and message system.

Kafka Main design objectives are as follows:

    • Provides message persistence in the form of Time complexity O (1), which guarantees constant time access performance even with terabytes of data.

    • High throughput rates. A single stand-alone support for 100K messages per second is possible even on very inexpensive commercial machines.

    • Supports message partitioning between Kafka servers, and distributed consumption, while guaranteeing the sequential transmission of messages within each partition.

    • It also supports offline data processing and real-time data processing.

Analysis of design principle of Kafka

A typical Kafka cluster contains several producer, several broker, several consumer, and a zookeeper cluster. Kafka manages the cluster configuration through zookeeper, elects leader, and rebalance when the consumer group is changed. Producer uses push mode to publish messages to Broker,consumer to subscribe to and consume messages from broker using pull mode.

Kafka Special Terminology:

    • Broker: Message middleware processing nodes, a Kafka node is a broker, and multiple brokers can form a Kafka cluster.

    • Topic: A class of messages, the Kafka cluster can be responsible for distributing multiple Topic simultaneously.

    • Partition:topic A physical grouping, a topic can be divided into multiple Partition, each Partition an ordered queue.

    • The segment:partition is physically composed of multiple Segment.

    • Offset: Each partition consists of a series of ordered, immutable messages that are appended to the partition consecutively. Each message in the partition has a sequential sequence number called offset, which is used to uniquely identify a message partition.

    • Producer: Responsible for publishing messages to Kafka broker.

    • Consumer: The message consumer, the client that reads the message to Kafka broker.

    • Consumer Group: Each Consumer belongs to a specific Consumer group.

Transaction characteristics of Kafka data transmission

    • At most once: up to once, this is similar to the "non-persistent" message in JMS, sent once, regardless of success or failure, will not resend. The consumer fetch the message, then saves the offset and processes the message, and when the client saves the offset, but an exception occurs during the message processing, some messages fail to continue processing. Then the "unhandled" message will not be fetch, which is "at the most once".

    • At least once: The message is sent at least once, and if the message fails to accept success, it may be re-sent until it is successfully received. The consumer fetch the message, then processes the message, and then saves the offset. If the message is processed successfully, but the zookeeper exception during the Save offset phase causes the save operation to fail to succeed, this results in the last message being processed when the fetch is next, "at least once", The reason that offset is not submitted in a timely manner to zookeeper,zookeeper restore normal or previous offset status.

    • Exactly once: messages are sent only once. Kafka is not strictly implemented (based on 2-phase commit), and we think this strategy is not necessary in Kafka.

Usually "At-least-once" is our first choice.

Kafka Message store format

Topic & Partition

A topic can be considered a class of messages, each topic will be divided into multiple partition, each partition at the storage level is the Append log file.

In the Kafka file store, there are several different partition under the same topic, each partition a directory, Partiton naming rules topic name + ordered ordinal number, the first Partiton sequence number starting from 0, The maximum number of partitions is minus 1.

    • Each partion (directory) is equivalent to a huge file that is evenly distributed across multiple equal segment (segment) data files. However, the number of segment file messages per segment is not necessarily equal, and this feature facilitates the quick deletion of old segment file.

    • Each partiton only needs to support sequential read and write, and the segment file lifecycle is determined by the server configuration parameters.

The advantage of this is that you can quickly delete useless files and effectively improve disk utilization.

    • Segment file composition: consists of 2 large parts, respectively, the index file and the data file, this 2 file one by one corresponds to, in pairs appear, suffix ". Index" and ". Log" are respectively represented as segment index files, data files.

    • Segment file naming rules: Partion The first segment of the global, starting with 0, each subsequent segment file name is the offset value of the last message in the previous segment file. The value is a maximum of 64 bits long, a 19-digit character length, and no number is filled with 0.

The physical structure of the corresponding relationship between index and data file in segment is as follows:

The index file stores a large amount of metadata, the data file stores a large number of messages, and the metadata in the index file points to the physical offset address of the message in the corresponding data file.

It takes metadata 3,497 in the index file as an example, and in turn represents the 3rd message in the data file (the global Partiton represents the NO. 368772 message), and the physical offset address of the message is 497.

Knowing that the segment data file consists of many message descriptions, the following details the physical structure of the message as follows:

Parameter description:

Key Words Explanatory notes
8 byte offset Each message within the Parition (partition) has an ordered ID number called offset, which uniquely determines the location of each message within the Parition (partition). That is, offset represents the number of partiion of the message
4 byte message size Message size
4 byte CRC32 Verifying a message with CRC32
1 byte "Magic" Indicates the release Kafka service protocol version number
1 byte "Attributes" Expressed as a standalone version, or an identity compression type, or encoding type.
4 byte key length Indicates the length of key, when key is-1, the K-byte key field is not filled
K byte key Options available
Value bytes Payload Represents the actual message data.

Copy (replication) policy

Kafka's high-reliability guarantee comes from its robust copy (replication) strategy.

1) Data synchronization

Kafka did not provide a partition replication mechanism before the 0.8 release, and once the broker goes down, all partition on it will not be able to provide services, and partition has no backup data, and the availability of the data is greatly reduced. So the replication mechanism is provided after 0.8 to guarantee the broker's failover.

With the introduction of replication, there may be multiple replica for the same partition, and there is a need to select a leader,producer and consumer between these replication to interact with this leader only. Other replica copy data from the leader as follower.

2) Copy Placement policy

For better load balancing, Kafka distributes all partition evenly across the cluster as much as possible.

The algorithm for assigning replica Kafka is as follows:

    • Sort all surviving n brokers and partition to be allocated

    • Assign the I partition to the (i mod n) broker, the first replica of this partition exists on the assigned broker and will be the preferred copy of the partition

    • Assigning the J replica of Partition I to the ((i + j) mod n) broker

Suppose that there are 4 brokers in a cluster, and a topic has 4 partition, each partition has 3 replicas. is the copy assignment on each broker.

3) Synchronization Policy

Producer when publishing a message to a partition, the leader of the partition is first found through zookeeper, and then no matter how much topic replication the factor is, Producer only sends the message to the leader of the partition. Leader writes the message to its local log. Each follower is pull data from leader. In this way, the data order stored by the follower is consistent with the leader. Follower sends an ACK to leader after it receives the message and writes its log. Once the leader receives an ACK from all replica in the ISR, the message is considered to be a commit, leader will increase the HW and send an ACK to producer.

To improve performance, each follower sends an ACK to leader immediately after receiving the data, rather than waiting until the data is written to log. Therefore, for a committed message, Kafka can only guarantee that it is stored in more than one replica of memory, and not guaranteed to be persisted to disk, it is not fully guaranteed that the message will be consumer consumption after the exception occurs.

Consumer read messages are also read from leader, and only messages that have been commit will be exposed to consumer.

The data flow for Kafka replication is as follows:

For Kafka, defining whether a broker is "alive" contains two conditions:

    • One is that it must maintain the session with zookeeper (this is achieved through the zookeeper heartbeat mechanism).

    • Second, follower must be able to copy the leader message in time, not "too much lag behind".

Leader keeps track of the list of Replica that it synchronizes with, which is called an ISR (that is, In-sync Replica). If a follower goes down or falls too far behind, leader will remove it from the ISR. The "Too much lag" described here refers to the number of follower copied messages that lag behind leader, or that the follower has not sent a fetch request to leader for more than a certain amount of time.

Kafka only resolves fail/recover, a message that is not considered committed until it is copied from leader by all follower in the ISR. This avoids some of the data being written into the leader, and has not been able to be copied by any follower to go down, resulting in data loss (consumer cannot consume the data). For producer, it can choose whether to wait for a message commit. This mechanism ensures that as long as the ISR has one or more follower, a commit message is not lost.

4) leader election

Leader elections are essentially a distributed lock, and there are two ways to implement zookeeper-based distributed locks:

    • Node name uniqueness: Multiple clients create a node and only clients that successfully create a node can obtain a lock

    • Staging nodes: All clients create their own staging nodes in a directory, with only the smallest sequence number to get the lock

The electoral strategy of Majority vote is similar to the Zab elections in zookeeper, and in fact zookeeper itself achieves a minority of voting strategies that obey the majority. The first way to elect a leader copy of partition in Kafka is to assign a copy to the partition, specify a Znode temporary node, and the first copy of the successfully created node is the leader node. Other replicas will register the watcher listener on this Znode node, and once the leader is down, the corresponding temporary node will be automatically deleted, and all follower registered on that node will receive listener events that will attempt to create the node. Only the follower that created the success will become leader (zookeeper guarantees that only one client of a node can be created successfully), and the other follower continues to reregister the listener events.

Kafka message grouping, message consumption principle

A message from the same topic can only be consumed by one consumer within the same consumer group, but multiple consumer group can consume the message at the same time.

This is the means by which Kafka is used to implement a topic message broadcast (to all consumer) and unicast (sent to a consumer). A topic can correspond to multiple consumer Group. If you need to implement a broadcast, as long as each consumer has a separate group. To implement unicast as long as all the consumer are in the same group. The consumer group also allows consumer to be freely grouped without having to send messages to different topic multiple times.

Push vs. pull

As a messaging system, Kafka follows the traditional way of choosing to push messages from producer to broker and pull messages from broker by consumer.

Push mode is difficult to adapt to consumers with different consumption rates because the message sending rate is determined by the broker. The goal of Push mode is to deliver messages as fast as possible, but this can easily cause consumer to process messages, typically denial of service and network congestion. The pull mode can consume messages at the right rate based on consumer's spending power.

For Kafka, the pull mode is more appropriate. Pull mode simplifies the design of the broker, consumer can control the rate of consumer messages autonomously, while consumer can control the consumption mode by itself--can be consumed in batches, and can choose different submission methods to realize different transmission semantics.

Kafak sequential write and data read

The producer (producer) is responsible for submitting data to Kafka, Kafka will write the received message to the hard disk, it will never lose data. To optimize the write speed Kafak employs two techniques, sequential writes and Mmfile.

Sequential Write

Because the hard disk is a mechanical structure, each read and write will address, write, where addressing is a "mechanical action", it is the most time-consuming. So the hard disk is the most "annoying" random I/O and I like sequential I/O. To improve the speed of reading and writing hard drives, Kafka is using sequential I/O.

Each message is append to the partition, which is a sequential write disk and is therefore highly efficient.

For a traditional message queue, it is common to delete messages that have already been consumed, and Kafka will not delete the data, it will keep all the data, and each consumer (Consumer) has an offset for each topic to indicate that it has read the first few data.

Even in sequential writes to the hard disk, the speed of access to the hard disk is not likely to catch up with memory. So Kafka data is not written to the hard disk in real time, it takes full advantage of modern operating system paging storage to improve I/O efficiency with memory.

After the Linux kernal 2.2, a system called "0 copy (zero-copy)" Call mechanism, is to skip the "user buffer" copy, set up a direct mapping of disk space and memory space, the data is no longer copied to the "User state buffer" System context switch reduced 2 times, Can be increased by one-fold performance.

Through MMAP, the process reads and writes memory (of course, virtual machine memory) like a hard disk. This allows for a large I/O promotion, eliminating the overhead of user-space-to-kernel-space replication (the read of the file calls the data into the memory of the kernel space before being copied into the memory of the user space.) )

Consumer (read data)

Imagine a Web server delivering a static file, how to optimize it? The answer is zero copy. In the traditional mode we read a file from the hard drive like this.

First copy to the kernel space (read is the system call, put in the DMA, so with the kernel space), and then copied to the user space (1, 2), from the user space to re-copy to the kernel space (your socket is a system call, so it also has its own kernel space), and finally sent to the network card (3, 4

Zero copy directly from the kernel space (DMA) to the kernel space (socket), and then send the network card. This technique is very common, and nginx is also used in this technique.

In fact, Kafka all the messages in one file, and when the consumer needs the data Kafka directly sends the "file" to the consumer. When the entire file does not need to be sent out, Kafka by calling Zero copy of the Sendfile function, which includes:

    • OUT_FD as output (usually the handle of the socket in time)

    • IN_FD as input file handle

    • off_t represents the offset of the IN_FD (where to start reading)

    • size_t indicates the number of reads

"Talking about the technical points in large-scale distributed Systems" series articles:

    • Talking about distributed transaction

    • Discussion on distributed service coordination Technology Zookeeper

Reference

Http://www.cnblogs.com/liuming1992/p/6423007.html

http://blog.csdn.net/lifuxiangcaohui/article/details/51374862

Http://www.jasongj.com/2015/01/02/Kafka Depth Analysis

Http://www.infoq.com/cn/articles/kafka-analysis-part-2

Http://zqhxuyuan.github.io/2016/02/23/2016-02-23-Kafka-Controller

Https://tech.meituan.com/kafka-fs-design-theory.html

https://my.oschina.net/silence88/blog/856195

Https://toutiao.io/posts/508935/app_preview


Reprint please and Mark: "This article reprint from linkedkeeper.com"

Talking about distributed message technology Kafka (turn)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.