Kafka in versions prior to 0.8, the high availablity mechanism was not provided, and once one or more broker outages, all partition on the outage were unable to continue serving. If the broker can never recover, or a disk fails, the data on it will be lost. One of Kafka's design goals is to provide data persistence, and for distributed systems, especially when the cluster scale rises to a certain extent, the likelihood of one or more machines going down is greatly increased, and the failover requirements are very high. Therefore, Kafka provides a high availability mechanism starting from 0.8. This paper introduces the HA mechanism of Kafka from data replication and leader election two aspects.
Why Kafka need high available why need replication
In Kafka versions prior to 0.8, there was no replication, and once a broker goes down, all of its partition data is not consumed, which is inconsistent with Kafka data persistence and delivery guarantee design goals. At the same time producer can no longer store the data in these partition.
- If producer uses synchronous mode, producer will
message.send.max.retries
throw exception after attempting to resend (the default is 3), and the user can choose to stop sending subsequent data or choose to continue selecting send. The former can cause data blocking, which can result in the loss of data that should be sent to the broker.
- If producer uses asynchronous mode, producer will attempt to resend
message.send.max.retries
(the default value is 3) and record the exception and continue sending subsequent data, which can result in data loss and the user can only discover the problem through the log. At the same time, Kafka's producer does not provide callback interfaces for asynchronous schemas.
Thus, in the absence of replication, once a machine goes down or a broker stops working, it can reduce the availability of the entire system. With the increase of cluster size, the probability of this kind of anomaly in the whole cluster increases greatly, so the introduction of replication mechanism is very important for production system.
Why do I need leader election
Note: The leader election described herein mainly refers to the leader election between replica.
With the introduction of replication, there may be multiple replica for the same partition, and there is a need to select a leader,producer and consumer between these replication to interact with this Leader only. Other replica copy data from the leader as follower.
Because there is a need to ensure data consistency between multiple replica of the same partition (one of the other replica must be able to continue service after the outage and cannot cause data duplication or data loss). If there is not a leader, all replica can read/write data at the same time, it is necessary to ensure that multiple replica between each other (NxN path) synchronization of data, data consistency and order is very difficult to guarantee, greatly increased the complexity of replication implementation, It also increases the chance of abnormal occurrence. After the introduction of leader, only leader is responsible for data read and write, follower only to leader sequential fetch data (n path), the system is more simple and efficient.
Kafka ha design resolves how to distribute all replica evenly across the entire cluster
For better load balancing, Kafka distributes all partition evenly across the cluster as much as possible. A typical deployment is a topic partition number that is larger than the number of brokers. At the same time, in order to improve the fault tolerance of Kafka, we also need to spread the same partition replica to different machines. In fact, if all the replica are on the same broker, if the broker goes down, all the replica of the partition will not work and will not achieve the HA effect. At the same time, if a broker goes down, it is necessary to ensure that the load on it can be evenly distributed to all other surviving brokers.
The algorithm for assigning replica Kafka is as follows:
- Sort all brokers (assuming a total of n broker) and partition to be allocated
- Assign the I partition to the first (i mod n) broker
- The first J replica of the partition is assigned to the ((i + j) mode N) broker
Data Replication
Kafka data replication needs to address the following issues:
- How to propagate News
- You need to ensure how many replica have received the message before sending an ACK to producer
- How to deal with a situation where a replica is not working
- How to deal with failed replica recovery back to the situation
Propagate message
Producer When a message is posted to a partition, the leader of the partition is first found by zookeeper, and then topic How much factor (that is, how many replica the partition has), producer sends the message only to partition of that leader. Leader writes the message to its local log. Each follower is pull data from leader. In this way, the data order stored by the follower is consistent with the leader. Follower sends an ACK to leader after it receives the message and writes its log. Once the leader receives an ACK from all replica in the ISR, the message is considered to be a commit, leader will increase the HW and send an ACK to producer.
To improve performance, each follower sends an ACK to leader immediately after receiving the data, rather than waiting until the data is written to log. Therefore, for a committed message, Kafka can only guarantee that it is stored in more than one replica of memory, and not guaranteed to be persisted to disk, it is not fully guaranteed that the message will be consumer consumption after the exception occurs. But given the rarity of this scenario, you can think of this as a good balance between performance and data persistence. In future releases, Kafka will consider providing a higher level of durability.
Consumer read the message is also read from the leader, only the message (offset lower than the HW message) will be exposed to consumer.
The data flow for Kafka replication is as follows:
How many backups need to be guaranteed before an ACK
Like most distributed systems, Kafka processing failure requires a clear definition of whether a broker is "alive". For Kafka, Kafka survival consists of two conditions, one of which must be maintained with the zookeeper session (this is achieved through the zookeeper heartbeat mechanism). Second, follower must be able to copy the leader message in time, not "too much lag behind".
Leader keeps track of the list of Replica that it synchronizes with, which is called an ISR (that is, In-sync Replica). If a follower goes down or falls too far behind, leader will remove it from the ISR. As described here, "too much lag" refers to the number of follower copied messages that fall behind leader after a predetermined value (this value can be configured in $kafka_home/config/server.properties replica.lag.max.messages
, The default value is 4000) or the follower does not send a fetch request to leader for more than a certain amount of time (the value can be configured in $kafka_home/config/server.properties replica.lag.time.max.ms
, and its default value is 10000).
The replication mechanism of Kafka is neither a full synchronous copy nor a simple asynchronous copy. In fact, full replication requires that all working follower be copied, and this message is considered a commit, which greatly affects the throughput rate (high throughput is an important feature of Kafka). In the asynchronous replication mode, follower asynchronously copy data from the leader, the data as long as the leader write log is considered to have committed, in this case if follower are all copied behind leader, and if leader suddenly down, The data will be lost. Kafka's approach to using the ISR is well balanced to ensure that data is not lost and throughput. Follower can replicate data in batches from leader, which greatly improves replication performance (bulk write disk), greatly reducing the gap between follower and leader.
It should be stated that Kafka only resolves fail/recover and does not deal with "Byzantine" ("Byzantine") issues. A message is considered committed only if it has been copied from leader by all follower in the ISR. This avoids some of the data being written into the leader, and has not been able to be copied by any follower to go down, resulting in data loss (consumer cannot consume the data). For producer, it can choose whether to wait for a message commit, which can be request.required.acks
set by. This mechanism ensures that as long as the ISR has one or more follower, a commit message is not lost.
Leader election algorithm
The above explains how Kafka does replication, and another important question is how to elect a new Leader in follower when Leader down. Because follower may be behind a lot or crash, make sure to select the "newest" follower as the new leader. A basic principle is that if leader is absent, the new leader must have all of the original leader commit messages. This requires a tradeoff, if leader waits for more follower confirmation before marking a message, then there will be more follower as the new leader after it goes down, but it will also cause a decrease in throughput.
A very common way of electing leader is "Majority Vote" ("Minority obedience to Majority"), but Kafka does not adopt this approach. In this mode, if we have 2f+1 replica (including leader and Follower), then there must be a f+1 replica copy of the message before commit, in order to ensure that the leader,fail of the new replica can not be more than f properly selected. Because at least one replica in the remaining f+1 replica contains all the latest messages. There is a big advantage in this approach, and the latency of the system depends only on the fastest broker, not the slowest one. Majority vote also have some disadvantages, in order to ensure the normal leader election, it can tolerate the follower number of fail is relatively small. If you want to tolerate 1 follower hanging off, must have more than 3 Replica, if you want to tolerate 2 follower hanging off, must have more than 5 Replica. In other words, in order to guarantee the high degree of fault tolerance in the production environment, there must be a lot of replica, and a large number of replica will lead to a sharp decline in performance under the large data volume. This is why this algorithm is more used in systems that share cluster configurations and rarely in systems that need to store large amounts of data zookeeper. For example, the HA feature of HDFs is based on majority-vote-based Journal, but its data storage is not used in this way.
In fact, the Leader election algorithm is very numerous, such as Zookeeper Zab, raft and viewstamped Replication. The leader election algorithm used by Kafka is more like Microsoft's Pacifica algorithm.
Kafka has dynamically maintained an ISR (In-sync replicas) in zookeeper, and all replica in the ISR have been leader, and only members of the ISR have been selected as leader. In this mode, for f+1 replica, a partition can tolerate the failure of F replica without losing the already committed message. In most usage scenarios, this pattern is very advantageous. In fact, in order to tolerate the failure of F-Replica, Majority vote and ISR need to wait for the same number of replica before commit, but the total replica required by the ISR is almost half Majority vote.
Although the majority vote has the advantage of not having to wait for the slowest broker compared to the ISR, the Kafka author believes that Kafka can improve the problem by producer choosing whether or not to be blocked by commit. And the saved replica and disks make the ISR mode still worthwhile.
How to handle all replica are not working
As mentioned above, when there is at least one follower in the ISR, Kafka can ensure that the data that has been committed is not lost, but if all replica of a partition are down, there is no guarantee that the data will not be lost. There are two possible scenarios in this case:
- Wait for any of the replica in the ISR to "live" and choose it as leader.
- Choose the first "live" replica (not necessarily in the ISR) as leader.
This requires a simple tradeoff between usability and consistency. If you must wait for the replica in the ISR to come over, the unavailable time may be relatively long. And if all the replica in the ISR are unable to "live" or the data is lost, the partition will never be available. Choose the first "live" replica as Leader, and this replica is not the replica in the ISR, even if it does not guarantee that it contains all of the committed messages, it will also be Leader as a consumer data source (described earlier , all reads and writes are done by leader). Kafka0.8.* uses the second way. According to Kafka's documentation, in a later release, Kafka enables users to select one of these two approaches through configuration, choosing high availability or strong consistency based on different usage scenarios.
How to elect leader
The simplest and most intuitive scenario is that all follower set a watch on the zookeeper, and once the leader is down, the corresponding ephemeral Znode is automatically deleted, and all follower attempts to create the node. A successful creator (zookeeper guarantees that only one can be created) is the new leader, and the other replica is follower.
However, there are 3 problems with this approach:
- Split-brain This is caused by the characteristics of zookeeper, although zookeeper can guarantee that all watches are triggered sequentially, but it is not guaranteed that all replica "see" the same state at the same time, which may cause different replica responses to be inconsistent
- Herd effect If there is more partition on the broker that is down, it will cause multiple watch to be triggered, resulting in a large number of adjustments in the cluster
- Zookeeper heavy load each replica is required to register a watch on zookeeper and zookeeper load will be overloaded when the cluster size increases to thousands of partition.
Kafka 0.8.* 's leader election solution solves the above problem by electing a controller in all brokers and all partition leader elections are determined by the controller. The controller will notify the broker that is required to respond to this by notifying the leader of the change directly via RPC (more efficient than the zookeeper queue). Controller is also responsible for adding and deleting topic and replica redistribution.
Ha-related zookeeper structures
First declare the zookeeper structure shown in this section, the solid wireframe represents the path name is fixed, and the dashed box represents the path name associated with the business
Admin (in this directory, Znode will only exist if there is a related operation, it will be deleted at the end of the operation)
/admin/preferred_replica_election Data structure
{"Fields": [{"Name": "Version", "type": "int", "Doc": "Version ID"}, {"N Ame ":" Partitions "," type ": {" type ":" Array "," items ": {" Fields ": [ {"Name": "topic", "Type": "string", "Doc": "topic of the par Tition for which preferred replica election should is triggered "}, { "Name": "Partition", "type": "int", "Doc": "The partition for which preferred repli CA election should be triggered '}],} "Doc": "An array of partitions For which preferred replica election should be triggered "}}]}example:{" version ": 1," Partitions ": [{"topic": "Topic1", "Partition": 8}, {"topic": "Topi C2 "," PartitiOn ": 16}]}
/admin/reassign_partitions
Used to assign some partition to different broker collections. For each Partition,kafka to be reassigned, all of its replica and corresponding broker IDs are stored on the Znode. The Znode is created by the management process and will be automatically removed once it is redistributed successfully. Its data structure is as follows:
{"Fields": [{"Name": "Version", "type": "int", "Doc": "Version ID"}, {"Name": "Partitions", "type": {"type": "Array", "Item S ": {" Fields ": [{" Name ":" topic "," Type ":" string "," Doc ":" topic of the partition to be reassigned "}, {" Name ":" Partition " , "type": "int", "Doc": "The partition to be reassigned"}, {"name": "Replicas", "type": "Array", "Items": "int", "Doc": "A lis T of replica IDs "}],}" Doc ":" An array of partitions to being reassigned to new Replicas "}}"}
example:{ "Version": 1, "Partitions": [ { "topic": "Topic3", "Partition": 1, " Replicas ": [1, 2, 3] } ] }
/admin/delete_topics Data structure:
schema:{"Fields": [{"Name": "Version", "type": "int", "Doc": "Version ID"}, {"name": "Topics", "type": {"T Ype ":" Array "," Items ":" string "," Doc ":" An array of topics to be deleted "} }]}example:{ " version ": 1, " Topi CS ": [" Topic4 "," Topic5 "]}
Brokers
Broker (that is /brokers/ids/[brokerId]
) stores the broker information "alive". The data structure is as follows:
schema:{"Fields": [{"Name": "Version", "type": "int", "Doc": "Version ID"}, {"name": "Host", "Type": "String", " Doc ":" IP address or host name of the broker "}, {" name ":" Port "," type ":" int "," Doc ":" Port of the Broker "}, {" Name ":" "Jmx_port", "type": "int", "Doc": "Port for Jmx"} ]}example:{ "Jmx_port":-1, "host": "Node1", "Version": 1, "port": 9092}
Topic registration Information ( /brokers/topics/[topic]
), the broker ID where all replica of all partition of the topic are stored, the first replica is preferred replica, for a given partition, It has a maximum of one replica on the same broker, so the broker ID can be used as the replica ID. The data structure is as follows:
schema:{"Fields": [{"Name": "Version", "type": "int", "Doc": "Version ID"}, {"name": "Partitions", "type" : {"type": "Map", "values": {"type": "Array", "Items": "int", "Doc": "A list of replica IDs"}, "Doc": "A map from Partition ID to replica list "}, } ]}example:{ " version ": 1, " partitions ": {" A ": [6], " 8 ": [2], "4": [6], "One": [5], "9": [3], "5": [7], "Ten": [4], "6": [8], "1" : [3], "0": [2], "2": [4], "7": [1], "3": [5]}}
The partition state ( /brokers/topics/[topic]/partitions/[partitionId]/state
) structure is as follows:
schema:{"Fields": [{"Name": "Version", "type": "int", "Doc": "Version ID"}, {"name": "ISR", "type": {"type ":" Array ", " items ":" int "," Doc ":" An array of the ID of replicas in ISR "} }, {" name ":" Leader "," type ": "int", "Doc": "ID of the leader replica"}, {"name": "Controller_epoch", "type": "int", "Doc": "Epoch of the Controller That last updated the leader and ISR info "}, {" name ":" Leader_epoch "," type ":" int "," Doc ":" Epoch of the leader "}
]}example:{ "Controller_epoch": "Leader": 2, "version": 1, "Leader_epoch": "," ISR ": [2]}
Controller
/controller -> int (broker id of the controller)
Storing information for the current controller
schema:{"Fields": [{"Name": "Version", "type": "int", "Doc": "Version ID"}, {"name": "Brokerid", "type": "int" "," Doc ":" Broker ID of the controller "} ]}example:{ " version ": 1," Brokerid ": 8}
/controller_epoch -> int (epoch)
The controller epoch is stored directly as an integer rather than as a JSON string like any other znode.
Broker Failover Process Introduction
- Controller registers watch in zookeeper, and once there is a broker outage (which is used to represent any scenario in which the system considers its die, including but not limited to machine power outages, network unavailability, GC-led Stop the world, process crash, etc.), The znode corresponding to the zookeeper will be automatically deleted, zookeeper the Watch,controller registered by the fire controller will read the latest surviving broker.
- The controller decides to set_p that the collection contains all the partition on all the brokers on the outage.
- For every partition in the Set_p
3.1 /brokers/topics/[topic]/partitions/[partition]/state
Read the current ISR from the partition
3.2 Decides the new leader of the partition. If at least one replica in the current ISR still survives, select one as the new leader, and the new ISR contains all the surviving replica in the current ISR. Otherwise, select any surviving replica in the partition as the new leader and ISR (potential data loss may occur in this scenario). If all replica of the partition are down, set the new leader to-1.
3.3 will be new LEADER,ISR and new and leader_epoch
controller_epoch
written /brokers/topics/[topic]/partitions/[partition]/state
. Note that the operation executes only if its version is unchanged from 3.1 to 3.3, otherwise jumps to 3.1
- Sends the leaderandisrrequest command directly through RPC to the set_p-related broker. Controller can increase efficiency by sending multiple commands in one RPC operation.
The broker failover sequence diagram is shown below.
About the author
very (Jason), master, is engaged in research and development of big data platform, proficient in distributed message system such as Kafka and Storm stream processing system.
Sina Weibo: Very _jason:habren personal blog: http://www.jasongj.com
Turn: Kafka design Analysis (ii): Kafka high Availability (UP)