Kafka is a distributed publish-subscribe messaging system. It is originally developed at LinkedIn and became a Apache project in July, 2011. Today, Kafka is used by LinkedIn, Twitter, and Square for applications including log aggregation, queuing, and real time m Onitoring and event processing.
In the upcoming version 0.8 release, Kafka'll support intra-cluster replication, which increases both the availability a nd the durability of the system. In the following post, I'll give an overview of Kafka ' s replication design.
Kafka Introduction
Kafka provides a publish-subscribe solution that can handle all activity stream data and processing on a consumer-scale we b site. This kind of activity (page views, searches, and other user actions) is a key ingredient in many of the social feature on The modern web. Kafka differs from traditional messaging systems in:
- It ' s designed as a distributed system that's easy to scale out.
- It persists messages on disk and thus can is used for batched consumption such as ETL, in addition to real time Applicatio Ns.
- It offers high throughput for both publishing and subscribing.
- It supports multi-subscribers and automatically balances the consumers during failure.
Check out the Kafka Design Wiki for more details.
Replication
With replication, Kafka clients would get the following benefits:
- A Producer can continue to publish messages during failure and it can choose between latency and durability, depending on The application.
- A consumer continues to receive the correct messages in real time, even when there was failure.
All distributed systems must make trade-offs between guaranteeing consistency, availability, and partition tolerance (CAP Theorem). Our goal were to replication in a Kafka cluster within a single datacenter, where network partitioning are rare, so Our design focuses on maintaining highly available and strongly consistent replicas. Strong consistency means, all replicas is byte-to-byte identical, which simplifies the job of an application develope R.
Strongly consistent replicas
In the literature, there is the typical approaches of maintaining strongly consistent replicas. Both require one of the replicas to being designated as the leader, to which all writes is issued. The leader is responsible for ordering all incoming writes, and for propagating those writes to other replicas (followers) , in the same order.
The first approach is quorum-based. The leader waits until a majority of replicas has received the data before it is considered safe (i.e., committed). On leader failure, a new leader is elected through the coordination of a majority of the followers. This approach are used in Apache Zookeeper and Google ' Sspanner.
The second approach is for the leader and wait for ' all ' (to being clarified later) replicas to receive the data. When the leader fails, any and replica can then take over as the new leader.
We selected the second approach for Kafka replication for both primary reasons:
- The second approach can tolerate more failures with the same number of replicas. That's, it can tolerate F failures with f+1 replicas, while the first approach often only tolerates F failures with 2f +1 Replicas. For example, if there is only 2 replicas and the first approach can ' t tolerate any failures.
- While the first approach generally have better latency, as it hides the delay from a slow replica, we replication is Desig Ned for a cluster within the same datacenter, so variance due to network delay is small.
Terminology
To understand how replication are implemented in Kafka, we need to first introduce some basic concepts. In Kafka, a message stream was defined by a topic, divided to one or more partitions. Replication happens at the partition level and each partition have one or more replicas.
The replicas is assigned evenly to different servers (called brokers) in a Kafka cluster. Each replica maintains a logs on disk. Published messages was appended sequentially in the log and each message was identified by a monotonically increasing OFFSE T within the log.
The offset is logical concept within a partition. Given an offset, the same message can is identified in each replica of the partition. When a consumer subscribes to a topic, it keeps track of an offset in each partition for consumption and uses it to issue Fetch requests to the broker.
Implementation
Figure 1. A Kafka cluster with 4 brokers, 1 topic and 2 partitions, each with 3 replicas
When a producer publishes a message to a partition in a topic, the message was first forwarded to the leader replica of the Partition and is appended to its log. The follower replicas keep pulling new messages from the leader. Once enough replicas has received the message, the leader commits it.
One subtle issue is how the leader decides what ' s enough. The leader can ' t always wait for writes and complete on all replicas. This was because any follower replica can fail and the leader can ' t wait indefinitely.
To address this problem, for each partition of a topic, we maintain an In-sync replica set (ISR). The set of replicas is alive and has fully caught up with the leader (note that the leader was always in ISR ). When a partition was created initially, every replica was in the ISR. When a new message was published, the leader waits until it reaches all replicas in the ISR before committing the message. If a follower replica fails, it'll be is dropped out of the ISR and the leader then continues to commit new messages with F Ewer replicas in the ISR. Notice that now, the system was running in an under replicated mode.
The leader also maintains a high watermark (HW), which are the offset of the last committed message in a partition. The HW is continuously propagated to the followers and are checkpointed to disk in each broker periodically for recovery.
When a failed replica are restarted, it first recovers the latest HW from disk and truncates their log to the HW. This was necessary since messages after the HW was not guaranteed to being committed and may need to be thrown away. Then, the replica becomes a follower and starts fetching messages after the HW from the leader. Once it had fully caught up, the replica was added back to the ISR and the system was back to the fully replicated mode.
Handling Failures
We rely on Zookeeper for detecting broker failures. Similar to Helix, we use a controller (embedded in one of the brokers) to receive all Zookeeper notifications about the FA Ilure and to elect new leaders. If a leader fails, the controller selects one of the replicas in the ISR as the new leader and informs the followers about The new leader.
By design, committed messages is always preserved during leadership change whereas some uncommitted data could be lost. The leader and the ISR for each partition is also stored in Zookeeper and is used during the failover of the controller. Both the leader and the ISR is expected to change infrequently since failures is rare.
For clients, a broker is only exposes committed messages to the consumers. Since committed data is all preserved during broker failures, a consumer can automatically fetch messages from another Replica, using the same offset.
A producer can choose when to receive the acknowledgement from the broker after publishing a message. For example, it can wait until the message was committed by the leader (i.e, it's received by all replicas in the ISR). Alternatively, it may choose to receive a acknowledgement as soon as the message is appended to the log in the leader Rep Lica is committed yet. In the former case, the producer have to wait a bit longer, but all acknowledged messages is guaranteed to being kept by the Brokers. In the latter case, the producer have lower latency, but a smaller number of acknowledged messages could is lost when a bro Ker fails.
More info
For more details about the design and the implementation, check out the Kafka Replication Wiki and drop by my Kafka Replicati On talk at Apachecon late February.
Reference From:http://engineering.linkedin.com/kafka/intra-cluster-replication-apache-kafka
Https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Replication
Intra-cluster Replication in Apache kafka--reference