Kafka does not provide a high availablity mechanism in previous versions of 0.8, and when one or more broker outages, all partition on the outage cannot continue to provide services. If the broker can never be restored, or if a disk fails, the data on it will be lost. And Kafka's design goal is to provide data persistence, at the same time for the distributed system, especially when the cluster size rise to a certain extent, one or more machines down the possibility of a great increase in the demand for failover mechanism is very high. As a result, Kafka provides a high availability mechanism starting from 0.8. In this paper, the HA mechanism of Kafka is introduced from data replication and leader election two aspects. Why Kafka need high Available Why need replication
In the Kafka version of 0.8, there is no replication, and once a broker is down, all the partition data on it is not consumed, which contradicts the design goals of Kafka data persistence and delivery guarantee. At the same time, producer can no longer save data in these partition. If producer uses synchronous mode, producer will throw exception after attempting to resend Message.send.max.retries (the default is 3), and the user can choose to stop sending subsequent data or choose to continue to choose to send. The former causes data congestion, which results in the loss of data that should have been sent to the broker. If producer uses asynchronous mode, producer attempts to resend the message.send.max.retries (the default is 3) after the exception is logged and continues to send subsequent data, which results in data loss and the user can only discover the problem through the log.
This shows that without replication, once a machine is down or a broker stops working, the availability of the entire system is reduced. With the increase of cluster size, the probability of this kind of anomaly is greatly increased in the whole cluster, so the introduction of replication mechanism is very important for production system. Why need leader election
(The leader election in this paper mainly refers to the leader election between replica)
After the introduction of replication, the same partition may have multiple replica, and then a leader,producer and consumer will be chosen between these replication to interact only with this leader, Other replica copy data from the leader as follower.
Because there is a need to ensure data consistency between multiple replica of the same partition (one of the other replica must be able to continue service and not cause data duplication or loss of data). If there is not a leader, all replica can read/write data at the same time, it is necessary to ensure that multiple replica between each other (NxN path) synchronization of data, data consistency and order is very difficult to guarantee, greatly increased the complexity of the replication implementation, It also increases the likelihood of anomalies. After the introduction of leader, only leader is responsible for data reading and writing, follower only to leader sequential fetch data (n path), the system is simpler and more efficient. analysis of Kafka ha design How to distribute all replica evenly across the cluster
In order to do the load balance better, Kafka as much as possible to distribute all the partition evenly across the cluster. A typical way to deploy is to have a topic partition number larger than the number of broker. At the same time, in order to improve the fault-tolerant ability of Kafka, it is necessary to spread the same partition replica as far as possible to different machines. In fact, if all the replica are on the same broker, then once the broker is down, all the replica of the partition will not work and the HA effect will not be reached. Also, if a broker is down, it is necessary to ensure that the load on it can be evenly distributed to all other surviving broker. The algorithm for allocating replica to Kafka is as follows: assigns all broker (assuming N-broker) and partition sorted to be assigned to the first I partition to the (i mod n) broker Assigns the J-replica of the first partition to the Data Replication on the first ((i + j) mode N) broker
Kafka's data replication needs to address the following questions: How to propagate message before sending an ACK to producer how many replica have received the message how to deal with a replica not working case failed Replica back to the situation propagate message
Producer When a message is posted to a partition, the leader of that partition is first found through zookeeper, and then regardless of the topic The number of factor (also known as the number of replica in the partition) Producer only sends the message to partition of that leader. Leader writes the message to its local log. Each follower is pull data from leader. In this way, the follower store data order is consistent with leader. Follower sends an ACK to leader after receiving the message and writing its log. Once the leader receives an ACK from all the replica in the ISR, the message is considered to have been commit, and leader will increase HW and send an ACK to producer.
To improve performance, each follower sends an ACK to the leader immediately after receiving the data, rather than waiting for the data to be written to the log. Therefore, for a message that has already been committed, Kafka can only guarantee that it is stored in more than one replica of memory, and that it is not guaranteed to be persisted to disk, and it is not guaranteed that the message will be consumer consumed after the exception. But given the rarity of this scenario, it can be thought that this approach is a better balance of performance and data persistence. In future releases, Kafka will consider providing a higher durability.
Consumer read messages are also read from leader, and only the messages that have been commits (offset less than HW) are exposed to consumer.
Kafka replication Data Flow The number of backups to be guaranteed before the ACK is shown in the following figure
As with most distributed systems, Kafka processing failures require a clear definition of whether a broker is "alive". For Kafka, Kafka survival consists of two conditions, one of which is that it must maintain a session with zookeeper (this is achieved through zookeeper heartbeat mechanism). Second, follower must be able to timely leader the message to copy over, not "too far behind."
Leader tracks the list of Replica that it keeps synchronized, which is called the ISR (that is, In-sync Replica). If a follower is down, or too far behind, leader will remove it from the ISR. The "Too much lag" described here means that the number of follower copied messages behind the leader exceeds the predetermined value (the value can be passed in $kafka_home/config/server.properties Replica.lag.max.messages configuration with a default value of 4000 or follower more than a certain time (this value can be passed in $kafka_home/config/server.properties replica.lag.time.max.ms to configure the default value of 10000) does not send a fetch request to the leader ...
The Kafka replication mechanism is neither full synchronous replication nor simple asynchronous replication. In fact, synchronous replication requires that all working follower be replicated, and this message is considered a commit, which greatly affects throughput (high throughput is a very important feature of Kafka). In asynchronous replication, follower asynchronously replicates data from the leader, and the data is believed to have been committed as long as it is leader written to log, in which case follower is behind leader if it is replicated, and if leader suddenly goes down, Data is lost. The Kafka's use of ISR is a well-balanced way to ensure data is not lost and throughput. Follower can bulk copy data from the leader, this greatly improve the replication performance (bulk write disk), greatly reducing the gap between follower and leader.
It should be explained that Kafka only solves Fail/recover and does not deal with the "Byzantine" ("Byzantine") problem. A message can only be considered submitted if it is copied from the leader by all follower in the ISR. This avoids the fact that some of the data is written into the leader and that it has not been able to be replicated by any follower, causing data loss (consumer cannot consume the data). For producer, it can choose whether to wait for a message commit, which can be set by Request.required.acks. This mechanism ensures that a commit message is not lost as long as the ISR has one or more follower. Leader election Algorithm
This explains how Kafka is doing replication, and another important question is how to elect a new leader in follower when leader is down. Because follower may lag behind many or crash, it is important to make sure that the "newest" follower is selected as the new leader. A basic principle is that if leader is gone, the new leader must have all the messages that the original leader commits. This requires a compromise, and if leader waits for more follower confirmation before a message is identified, there will be more follower as a new leader after it goes down, but this can also result in a drop in throughput.
A very common way of leader election is "majority Vote" ("The minority obeys the majority"), but Kafka does not use this approach. In this mode, if we have 2f+1 a replica (including leader and follower), then the f+1 must be guaranteed to copy the message before the commit, in order to ensure that the replica of the new leader,fail can be correctly selected no more than F. Because at least one replica contains all the latest messages in any of the remaining f+1 replica. This approach has a great advantage, and the latency of the system depends only on the fastest number of broker, not the slowest one. Majority vote also has some disadvantages, in order to ensure the normal conduct of leader election, it can tolerate fail number of relatively few. If 1 follower are to be hung up, there must be more than 3 replica, and 5 or more replica must be tolerated if 2 follower are to be put off. In other words, in the production environment in order to ensure a high degree of fault tolerance, there must be a large number of replica, and a large number of replica in the large amount of data will lead to a sharp decline in performance. This is why this algorithm is more used in systems that zookeeper this shared cluster configuration and rarely in systems that need to store large amounts of data. For example, the HA feature for HDFs is based on majority-vote-based Journal, but its data storage does not use this approach.
In fact, there are many Leader election algorithms, such as Zookeeper Zab, Raft and viewstamped Replication. The leader election algorithm used by Kafka is more like Microsoft's PacificA algorithm.
Kafka dynamically maintained an ISR (In-sync replicas) in zookeeper, and all replica in the ISR were followed by leader, and only members of the ISR could be chosen as leader. In this mode, for f+1 a replica, a partition can tolerate the failure of F replica without losing the message that has been committed. In most usage scenarios, this pattern is very advantageous. In fact, in order to tolerate the failure of F-replica, majority vote and ISR have to wait for the same number of replica before a commit, but the total number of replica The ISR needs is almost half the majority vote.
Although majority vote has the advantage of not waiting for the slowest broker compared to the ISR, the Kafka author believes that Kafka can improve the problem by producer choosing whether to be blocked by a commit. And the saved replica and disks make the ISR pattern still worthwhile. How to handle all replica don't work
As mentioned above, when there is at least one follower in the ISR, Kafka can ensure that the data that has been committed is not lost, but if all replica of a partition are down, there is no guarantee that the data will not be lost. In this case, there are two possible scenarios: wait for any replica in the ISR to "live" and choose it as a leader to select the first "Live" replica (not necessarily an ISR) as a leader
This requires a simple tradeoff between usability and consistency. If you must wait for the replica "live" in the ISR, the time may be relatively long. And if all the replica in the ISR are not "alive" or the data is lost, the partition will never be available. Select the first "Live" replica as leader, and this replica is not a replica in the ISR, and even if it does not guarantee that all committed messages are included, it will be leader as a consumer source of data (as described in the preceding article). , all reads and writes are completed by leader). Kafka0.8.* used the second way. According to Kafka's documentation, in a future release, Kafka supports one of the two ways in which users choose between configurations to select high availability or strong consistency based on different usage scenarios. How to elect leader
The simplest and most intuitive scenario is that all follower have a watch on the zookeeper, and once the leader is down, its ephemeral znode is automatically deleted, and all follower try to create the node. And the creator (zookeeper guarantee that only one can create success) is the new leader, and the other replica is follower. But there are 3 problems with this approach: Split-brain This is caused by the characteristics of the zookeeper, although zookeeper guarantees that all watch are triggered sequentially, it does not guarantee that all replica "see" states are the same at the same time, This may result in inconsistent response of different replica herd effect if there is more partition on that broker on the outage, multiple watch will be triggered, causing a lot of adjustment zookeeper overload in the cluster Each replica has to register a watch for this on zookeeper, and when the cluster size is increased to thousands of partition the zookeeper load is overloaded.
The leader election programme of the Kafka 0.8.* solves the above problem by electing a controller in all the broker's, and all partition leader elections are decided by controller. Controller will notify the broker that is required to respond to leader changes directly via RPC (more efficient than the zookeeper queue). At the same time controller is also responsible for additions and deletions topic and replica redistribution. ha-related zookeeper structure
(in the zookeeper structure shown in this section, the solid wireframe represents a fixed path name, and a dashed box represents the path name that is business-related)
Admin (the directory under Znode only exists if there is a relevant action, it will be deleted at the end of the operation)
/admin/preferred_replica_election Data structure
Schema: {"Fields": [{"Name": "Version", "type": "int", "Doc": "Versio
N ID "}, {" Name ":" Partitions "," type ": {" type ":" Array ",
' Items ': {' Fields ': [{' Name ': ' Topic ', ' Type ': ' String ', ' Doc ': ' topic of the partition for which preferred replica election sho
Uld be Triggered "}, {" Name ":" Partition ", ' type ': ' int ', ' doc ': ' The partition for which preferred replica election should to be Trigg Ered "}],}" Doc ":" An array of partitions for which Preferred replica election should be triggered "}}]} Example: {" ver Sion ": 1," partitioNS ": [{" topic ":" Topic1 "," Partition ": 8},
{"topic": "Topic2", "Partition": 16}]
}
The
/admin/reassign_partitions is used to assign some partition to a different broker collection. For each Partition,kafka to be reassigned, all its replica and corresponding broker IDs are stored on the Znode. The Znode is created by the management process and will be automatically removed once it is successfully reassigned. Its data structure is as follows
Schema: {"Fields": [{"Name": "Version", "type": "int", "Doc": "Versio
N ID "}, {" Name ":" Partitions "," type ": {" type ":" Array ",
' Items ': {' Fields ': [{' Name ': ' Topic ', ' Type ': ' String ', ' Doc ': ' topic of the partition to be reassigned '}
, {"Name": "Partition", "type": "int", "Doc": "The partition to be reassigned"}, {"NA Me: "Replicas", "type": "Array", "Items": "int", "do C ":" A list of replica IDs "}",} "Doc": "An array of Partitions to be ReassigNed to New Replicas "}}]} Example: {" version ": 1," Partitions ":
[{"topic": "TOPIC3", "Partition": 1, "Replicas": [1, 2, 3]
}
]
}
/admin/delete_topics Data structure
Schema:
{"Fields":
[{"Name": "Version", "type": "int", "Doc": "Version ID"},
{"name": "Topics",
"type ': {' type ': ' array ', ' Items ': ' String ', ' Doc ': ' An array of ' topics to is deleted '}
}]
}
Example:
{
Version ": 1,
" topics ": [" Topic4 "," Topic5 "]
}
Brokers
Broker (that is,/brokers/ids/[brokerid]) stores "alive" broker information. Data structure is as follows
Schema:
{"Fields":
[{"Name": "Version", "type": "int", "Doc": "Version ID"},
{"name": "Host", "type": "s Tring "," Doc ":" IP address or host name of the broker "},
{" name ":" Port "," type ":" int "," Doc ":" Port of the Broker "} ,
{"name": "Jmx_port", "type": "int", "Doc": "Port for Jmx"}
]
}
Example:
{
"Jmx_port": -1, "
host": "Node1",
"version": 1,
"port": 9092
}
Topic registration Information (/brokers/topics/[topic]), which stores the broker ID of all Replica for all partition of the topic, and the first Replica is preferred Replica, For a given partition, it has a maximum of one replica on the same broker, so the broker ID can be used as the replica ID. Data structure is as follows
Schema:
{"Fields":
[{"Name": "Version", "type": "int", "Doc": "Version ID"},
{"name": "Partitions",
" Type ': {' type '-' map ',
' values ': {' type ': ' array ', ' Items ': ' int ', ' doc ': ' A list of replica IDs '},
' Doc ': ' A map From partition ID to replica list "},
}
]
}
Example:
{
" version ": 1,
" Partitions ": {"A": [6], "8": [2], "4": [6], "One": [5], "9": [3], "5": [7], "
Ten": [4],
"6 ": [8],
" 1 ": [3]," 0 ": [2]," 2 ": [4]," 7 ": [1],"
3 ": [5]}
The partition state (/BROKERS/TOPICS/[TOPIC]/PARTITIONS/[PARTITIONID]/STATE) structure is as follows
Schema:
{"Fields":
[{"Name": "Version", "type": "int", "Doc": "Version ID"},
{"name": "ISR",
"type": { ' Type ': ' array ', '
items ': ' int ',
' doc ': ' An array of the ' ID of replicas in ISR}
},
{' name ': ' Leader ', ' Ty PE ": int", "Doc": "ID of the leader replica"},
{"name": "Controller_epoch", "type": "int", "Doc": "Epoch of the CONTR Oller that last updated the leader and ISR info},
{"name": "Leader_epoch", "type": "int", "Doc": "Epoch of the leader '}
]
}
Example:
{
"Controller_epoch": "
Leader": 2,
"version": 1,
" Leader_epoch ":",
"ISR": [2]
}
Controller
/controller-> int (Broker ID of the controller) stores information for the current controller
schema: {"Fields": [{"Name": "Version", "type": "int", "Doc": "Version ID"},
{"Name": "Brokerid", "type": "int", "Doc": "Broker ID of the controller"}] } Example: {"version": 1, "Brokerid": 8}
The
/controller_epoch-> Int (EPOCH) stores controller epoch directly as integers, rather than as JSON strings as other znode. Broker Failover Process Introduction Controller registers watch in zookeeper, once there is a broker outage (this is using downtime to represent any scenario where the system considers its die, including but not limited to machine power outages, network unavailable, GC-led Stop the world, process crash, etc.), Its corresponding znode in zookeeper will automatically be deleted, and zookeeper will fire controller registered Watch,controller to read the latest surviving broker controller decision Set_p, The collection contains all the partition on all of the broker on the outage to set_p each partition
3.1 from/brokers/topics/[topic]/partitions/[partition]/state Reading the current ISR
3.2 of the partition determines the new leader for the partition. If at least one of the replica in the current ISR survives, select one as the new leader, and the new ISR contains all surviving replica in the current ISR. Otherwise, select any surviving replica in the partition as a new leader and ISR (there may be potential data loss in this scenario). If all replica of the partition are down, the new leader is set to-1.
3.3 writes the new LEADER,ISR and new Leader_epoch and Controller_epoch to the/brokers/topics/[topic]/partitions/[partition]/state. Note that this operation only executes if its version is unchanged in the course of 3.1 to 3.3, otherwise jumps to 3.1 send leaderandisrrequest commands directly via RPC to SET_P-related broker. Controller can increase efficiency by sending multiple commands in an RPC operation. The
Broker failover sequence diagram is shown below.