Deep analysis of replication function in Kafka cluster

Source: Internet
Author: User

Kafka is a distributed publishing subscription messaging system. Developed by LinkedIn and has become the top project in Apache in July 2011. Kafka is widely used by many companies such as LinkedIn, Twitte, etc., mainly for: Log aggregation, Message Queuing, real-time monitoring and so on.

Starting with version 0.8, Kafka supports intra-cluster replication for increased availability and system stability, and this article outlines the design of Kafka replication.

Copy
With replication, the Kafka client will have the following benefits:

Producers can continue to publish messages in the event of a failure, and can choose between latency and persistence, depending on the application.
Consumers can continue to receive the right message in real time in the event of a failure.
All distributed systems must trade-offs between consistency, availability, partition fault tolerance (refer to the cap theorem), and Kafka's goal is to support replication in Kafka clusters in a single data center. Network partitioning is relatively rare, so the Kafka design is focused on high availability and strong consistency. Strong consistency means that all replica data is exactly the same, which simplifies the work of application developers.
Kafka is a CA-based system (???), Zookeeper is a CP-based system (quite certain), and Eureka is an AP-based system (very deterministic).

Replication Strong consistency
There are two typical approaches to maintaining strong consistent replication in existing, more mature scenarios. Both of these methods require that one of the replicas be designed as leader, and all writes need to be published to that replica. The leader is responsible for handling all access. and broadcast these to other follower copies, and to ensure that the replication order is consistent with the order of the leader.

The first method is based on quorum. Leader waits until most replicas receive data. When leader fails, most follower will coordinate the election of new leader. This method is used for Apache Zookeeper and Google's spanner.
The second approach is to leader wait for all replicas to receive data (Important note: This "All" in Kafka is all In-sync replicas). If leader fails, other replicas can be elected as new leader.
Kafka replication chooses the second method, which has two main reasons:

The second method can withstand more fault tolerance in the same number of replicas. For example, with a total of 2n+1 replicas, the second method can withstand a 2n copy failure (as long as there is an ISR that can write normally), while the first method can withstand only n replica failures. If, in the case of only two replicas, the first method does not tolerate any one replica failure.

The first method has better latency because only quorum confirmation is required, so the effect of some slower copies is hidden. The Kafka copy is designed in a cluster in the same data center. So the network latency of this variable is relatively small.

Terms
To understand how replicas in Kafka are implemented, we first need to introduce some basic concepts. In Kafka, the message flow is defined by topic, topic is cut into 1 or more partitions (partition), replication occurs at the partition level, and each partition has one or more replicas.

Replicas are evenly distributed across different servers (called brokers) in the Kafka cluster. Each replica maintains a log on the disk. The order in which the producer publishes messages is appended to the log, and each message in the log is identified by a monotonically incrementing offset.

Offset is a logical concept within a partition that, given an offset, can identify the same message in each copy of the partition. When a consumer subscribes to a topic, it tracks the offsets in each partition for use and uses it to send a request for a message to the broker.

Design
The goal of increasing replicas in Kafka is to be more durable and highly available. Kafka to ensure that any messages that are successfully published are not lost and can be consumed, even if there are some server outages. The main objectives of Kafka replication are:

Configurable persistence guarantees: for example, some data cannot tolerate lost applications, can choose stronger persistence, and, of course, will accompany the growth of latency. Another application that generates a huge amount of data loss can choose a slightly weaker persistence, but better write response time and better throughput.

Automated replica management: Kafka to simplify the assignment of replicas to broker, and to support the gradual expansion & scaling of the cluster.

In this case, there are two main issues that need to be addressed:

How do I evenly assign a copy of a partition to a broker?
For a given partition, how do I broadcast each message to another copy?
Data replication
Kafka allows the client to select asynchronous or synchronous replication, and the published message can be confirmed when it is received by 1 copies. Synchronous replication, Kafka does its best to ensure that the message arrives at multiple copies (so a valid ISR) before it is confirmed. When a client attempts to publish a message to a topic partition, Kafka must propagate this message to all replicas, Kafka must decide:

how to spread;
How many copies are required to receive the message before confirming to the client;
How to deal with a copy after a failure;
What to do when a failed copy is restored;
Realize
There are two common strategies for maintaining replica synchronization: Primary and standby replication and quorum-based replication. In both cases, a replica is designed as leader, and the other replicas are called follower, all write requests are handled by leader, and leader propagate write requests to follower.

Under Primary and Standby replication, leader waits until each copy is written in this group to complete before sending an acknowledgement to the client. If a replica fails, leader removes it from this group and continues to write to the remaining copy. A failed copy is also allowed to rejoin the group as soon as it recovers and catch up on leader. Under the premise of N replicas, the primary and standby replication mode can tolerate n-1 replica failures.

Under the quorum-based method, leader waits until the write is completed on most replicas, and the size of the replica group does not change due to some replica failure (for example, a partition has 5 replicas, even if there are 2 replicas, we consider the replica group to have 5 replicas). Therefore, if there are 2n+1 replicas, then only n replica failures can be tolerated based on quorum replication. If leader fails, you need to n+1 at least one copy to elect a new leader.

The two methods need to be weighed against each other:

Based on a better write delay than the primary, the delay of any replica (for example, FGC causes long STW) increases the write latency of the master and standby method, but does not increase the write latency of the Quorum method.
In the case of the same number of replicas, the primary and standby methods can tolerate more failures.
Under the premise of the main preparation method, the replica factor is 2 and can be run well. However, in replication based on the quorum method, two replicas must remain active for the duration of the operation.
Kafka chooses the primary and standby replication because it tolerates more replica failures, and only 2 replicas work correctly.
Synchronous replication
Kafka synchronous replication is a typical master-and-backup approach, with n replicas per partition and tolerance for n-1 replica failures. Only one copy was elected as leader, and the others were follower. Leader maintains an ISR collection: This replica set is fully and leader synchronized, and Kafka also maintains the current leader and current ISR into zookeeper.

Each copy holds information in the local log and maintains an important offset location in the log. Leo represents the tail of the log, and HW is the offset of the latest commit message. Each log is periodically synchronized to disk, and the data before the offset that has been flushed is guaranteed to remain on disk.

Write
In order to publish the message to the partition, the client first finds the partition's leader from the zookeeper and then sends the message to the leader. Leader writes a message to its local log, and each follower often pulls the latest message from leader. So, the order of all messages received by follower is consistent with leader, follower writes each received message to its local log and sends an acknowledgment to leader. Once the leader receives confirmation of all ISR replicas, the message can be submitted. Leader advances the HW and sends an acknowledgement to the client. For better performance, each follower sends a confirmation after it writes the message to memory. Therefore, for each submitted message, we guarantee that it is saved to multiple copies of the content however, there is no guarantee that any replicas have persisted committed messages to disk.

Because of the relative rarity of this kind of fault, this method can give us a good balance between response time and persistence. In the future, Kafka may consider adding an option parameter to provide a stronger guarantee.

Read
To simplify, reading is also a leader service, and only the messages that are above the HW are exposed to the consumer.

Asynchronous replication
To support asynchronous replication, leader can notify the client immediately after the message is written to the local log. The only thing to note is that in the catch-up phase, follower must truncate the data after the HW position. Follower is primarily asynchronous replication, so there is no guarantee that the submitted message will not be lost after the broker failure.

Replication implementations
Kafka replication is as follows:

Cluster total of 4 broker (BROKER1~BROKER4);
1 topic,2 partitions, 3 copies;
Partition 1 is TOPIC1-PART1 leader on Broker1, Partition 2 is TOPIC1-PART2 leader on Broker4;
Producer writes a message to the leader on the partition topic1-part1 (on Broker1), and then copies it to its two copies, respectively, on Broker2 and Broker3.

Producer writes a message to the leader on the partition topic1-part2 (on Broker4), and then copies it to its two copies, respectively, on Broker2 and Broker3.

When a producer publishes a message to a partition of topic, the message is first passed to the leader copy and the log is appended. The follower copy pulls new messages from the leader, and leader submits the message once enough copies are received.

Here's a question of how leader decides what is enough. Kafka maintains a collection of In-sync replica (ISR). This ISR replica set is alive and fully catches the leader copy, with no message latency (leader is always in the ISR collection). When partition initialization is created, each replica is in the ISR collection. When a new message is published, leader waits until all the ISR replicas receive the message before the message is submitted. If a follower replica fails, it will be removed from the ISR. Leader will continue to submit new messages, except that the number of ISR is less than the number of replicas when the partition was created.

Note that the system is now running in under replicated mode.

The leader also maintains the high watermark (HW, which can be translated into higher water levels), which is the offset of the last commit message in the partition. HW will be continuously propagated to follower copies:

Kafka High Watermark
When a failed copy is restarted, it first recovers the latest HW from the disk and truncates the log to HW. This is necessary because there is no guarantee that the message after the HW will be committed, so it may need to be discarded. The copy then becomes follower and continues to get the HW message from leader. Once the leader is fully caught, the copy is added to the ISR from the new one. The system will revert back to fully replicated mode.

Fault Handling
Kafka relies on zookeeper to detect a broker failure, Kafka will receive all zookeeper related notifications about failures, election of new leader with one controller (one in the broker collection), and there is a benefit Reduces the pressure on the zookeeper. If a leader fails, the controller elects a new leader from the ISR copy and publishes a new leader message to the other follower.

By design, the messages that have been submitted will always be retained during the leader election process, and some uncommitted messages may be lost. Leader and the ISR for each partition are also saved in zookeeper, which is required when the controller fails over. The expected leader and ISR will not change frequently because the broker-level failures are generally very small.

For clients, broker only exposes messages that have been submitted to consumers. Committed data is always retained during a broker failure. Consumers using the same offset can pull messages from another copy that is elected as leader.

The producer can choose when the broker receives the message and the broker confirms it. For example, it can wait until the message is leader submitted and is confirmed by all the ISR (that is, acks=-1). Alternatively, the message can be selected as long as the leader is appended to the log and may not have been committed (acks=0 indicates that there is no need to wait for leader confirmation, Acks=1 indicates that a leader acknowledgment is required). In the former case, Acks=-1, the producer needs to wait a longer time. However, the confirmed messages are guaranteed to be retained in the broker. In the latter case, acks=0 or 1, the producer has lower latency and higher throughput, but some acknowledged messages may be lost in the event of a broker failure. It's up to you to decide how to choose.

Deep analysis of replication function in Kafka cluster

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.