Kafka Design Analysis (iii)-Kafka high Availability (lower)

Source: Internet
Author: User
Tags failover

Summary

In this paper, based on the previous article, the HA mechanism of Kafka is explained in detail, and various ha related scenarios such as broker Failover,controller Failover,topic creation/deletion, broker initiating, Follower a detailed process from leader fetch data. It also introduces the replication related tools provided by Kafka, such as redistribution partition, etc.

Broker failover process controller to broker failure
    The
    1. controller registers watch on the /brokers/ids node of zookeeper. Once there is a broker outage (this article uses downtime to represent any scenario where Kafka considers its broker die, including but not limited to machine power outages, network unavailability, GC-led Stop the world, process crash, etc.), The znode corresponding to the zookeeper will be automatically deleted, and zookeeper will be able to obtain the latest surviving broker list from the Watch,controller registered by the fire controller. The
    2. controller determines set_p, which contains all the partition on all the brokers that are down.
    3. for each partition in Set_p:
      3.1 from /brokers/topics/[topic]/partitions/[partition]/state Reads the current ISR for the partition.
      3.2 Determines the new leader for the partition. If at least one replica in the current ISR still survives, select one as the new leader, and the new ISR contains all the surviving replica in the current ISR. Otherwise, select any surviving replica in the partition as the new leader and ISR (potential data loss may occur in this scenario). If all replica of the partition are down, set the new leader to-1.
      3.3 writes the new LEADER,ISR and the new Leader_epoch and Controller_epoch to /brokers/topics/[topic]/ Partitions/[partition]/state . Note that this operation only executes if the controller version is not changed in the 3.1 to 3.3 process, otherwise jumps to 3.1. The
    4. sends the leaderandisrrequest command directly through RPC to the set_p-related broker. Controller can increase efficiency by sending multiple commands in one RPC operation. The
      Broker failover sequence diagram is shown below.

The leaderandisrrequest structure is as follows

The leaderandisrresponse structure is as follows

Create/delete Topic
    1. The controller registers watch on the Zookeeper /brokers/topics node, and once a topic is created or deleted, the controller will get the Partition/replica assignment of the newly created/deleted topic through watch.
    2. For the delete topic operation, the topic tool will save the topic name in the /admin/delete_topics . If delete.topic.enable true, /admin/delete_topics watch on which the controller is registered is sent by Fire,controller to the corresponding broker via a callback, and if False the controller does not The /admin/delete_topics topic operation is only recorded and not executed when watch is registered on the event.
    3. For the Create topic operation, the controller /brokers/ids reads from all currently available broker lists, for each partition in Set_p:
      3.1 Select one of the available brokers from all replica (called AR) assigned to the partition as the new leader and set AR as the new ISR (because the topic is newly created, so all replica in AR have no data, It can be thought that they are all synchronous, that is, all in the ISR, any one of the replica can be used as leader)
      3.2 Write the new leader and ISR/brokers/topics/[topic]/partitions/[partition]
    4. Sends leaderandisrrequest directly to the relevant broker via RPC.
      The Create topic sequence diagram is shown below.
Broker Response Request Process

Broker throughkafka.network.SocketServerand related modules to accept and respond to various requests. The entire network communication module is based on Java NIO Development, and employs the reactor model, which contains 1 acceptor responsible for accepting customer requests, n processor is responsible for reading and writing data, M handler processing business logic.
Acceptor's primary responsibility is to listen and accept connection requests from the client (the requesting initiator, including but not limited to the Producer,consumer,controller,admin Tool), and to establish and client-side data transfer channels, It then assigns a processor to the client, which ends the task for that client's request, and it can respond to the next client's connection request. The core code is as follows.

  
Processor is primarily responsible for reading data from the client and returning the response to the client, which itself does not handle the specific business logic and maintains a queue internally to hold all the socketchannel assigned to it. The processor Run method loops out the new Socketchannel from the queue andSelectionKey.OP_READRegister to selector, and then loop through the ready read (request) and write (response). After the processor has read the data, it encapsulates it as a request object and gives it to Requestchannel.
Requestchannel is where processor and Kafkarequesthandler exchange data, and it contains a queue requestqueue to store the request for processor, Kafkarequesthandler will remove the request from the inside, and it also contains a respondqueue, which is used to store the response returned to the client Kafkarequesthandler after the request has been processed.
Processor will take the response of the responsequeue saved in Requestchannel by Processnewresponses method, and the correspondingSelectionKey.OP_WRITEThe event is registered on the selector. When the selector Select method returns, the Write method is called to return response to the client when the writable channel is detected.
The Kafkarequesthandler loop takes the request from the Requestchannel and giveskafka.server.KafkaApisHandle the specific business logic.

Leaderandisrrequest Response Process

For the received leaderandisrrequest,broker mainly through the Replicamanager becomeleaderorfollower processing, the process is as follows:

    1. If the Controllerepoch in the request is less than the current Controllerepoch, the Errormapping.stalecontrollerepochcode is returned directly.
    2. For each element in the Partitionstateinfos in the request, that is ((topic, PartitionID), Partitionstateinfo):
      2.1 If the leader epoch in the Partitionstateinfo is greater than the PartitionID epoch of partition (topic, leader) stored in the current Replicmanager, then:
      2.1.1 If the current Brokerid (or replica ID) is in Partitionstateinfo, The partition and partitionstateinfo are stored in a hashmap named Partitionstate.
      2.1.2 Otherwise, the broker is not in the replica list assigned by the partition, and the information is recorded in log
      2.2 Otherwise, the corresponding error code (ERRORMAPPING.STALELEADEREPOCHCODE) is stored in the response
    3. Filter out all records in Partitionstate that leader are equal to the current broker ID in Partitionstobeleader, and other records in Partitionstobefollower.
    4. If the partitionstobeleader is not empty, the Makeleaders party is executed against it.
    5. If Partitionstobefollower is not empty, the Makefollowers method is executed against it.
    6. If the Highwatermak thread has not started, start it and set hwthreadinitialized to True.
    7. Turn off all idle states of the fetcher.

The leaderandisrrequest process is as shown

Broker startup Process

After the broker starts, it first creates a temporary child node (ephemeral node) based on its ID under the Zonde of Zookeeper /brokers/ids , and the controller's Replicastatemachine registers the broker on it after the successful creation. Change watch will be fire, thus completing the following steps via the callback Kafkacontroller.onbrokerstartup method:

    1. Send Updatemetadatarequest to all newly started brokers, which are defined as follows.
    2. All replica on the newly started broker are set to the Onlinereplica state, and the broker initiates the high watermark thread for those partition.
    3. Trigger Onlinepartitionstatechange via Partitionstatemachine.
Controller Failover

The controller also needs failover. Each broker will /controller register a watch on the controller Path (). When the current controller fails, the corresponding controller path will automatically disappear (because it is ephemeral Node), when the Watch is fire, all "live" The broker will be running for the new controller (creating a new controller Path), but there will only be one successful campaign (this is guaranteed by zookeeper). The winner of the campaign is the new leader, and the loser is re-registering watch on the new controller path. Because zookeeper watch is a one-time, the fire is invalidated once, so it needs to be re-registered.
After the broker has successfully campaigned for a new controller, the Kafkacontroller.oncontrollerfailover method is triggered and the following actions are done in the method:

  1. Read and add the controller Epoch.
  2. /admin/reassign_partitionsRegister watch on Reassignedpartitions Patch ().
  3. /admin/preferred_replica_electionRegister watch on Preferredreplicaelection Path ().
  4. Register watch on broker Topics Patch ( /brokers/topics ) via Partitionstatemachine.
  5. If delete.topic.enable set to True (the default value is False), Partitionstatemachine registers watch on the delete Topic Patch ( /admin/delete_topics ).
  6. Register watch on broker Ids Patch ( /brokers/ids ) via Replicastatemachine.
  7. Initializes the ControllerContext object, sets the current list of all topic, "Live" broker lists, all partition leader and ISR, and so on.
  8. Start Replicastatemachine and Partitionstatemachine.
  9. Set the Brokerstate state to Runningascontroller.
  10. Send the leadership information for each partition to all "live" brokers.
  11. If auto.leader.rebalance.enable configured to True (the default value is True), the partition-rebalance thread is started.
  12. If delete.topic.enable set to True and delete Topic Patch ( /admin/delete_topics ) has a value, the corresponding Topic is deleted.
Partition re-allocation

After the management tool issues a reallocation of the partition request, the information is written to the /admin/reassign_partitions top, and the action triggers the Reassignedpartitionsisrchangelistener, The following actions are performed by executing the callback function Kafkacontroller.onpartitionreassignment:

  1. Update the AR (current Assigned replicas) in zookeeper to Oar (Original list of replicas for partition) + RAR (reassigned replicas).
  2. Force update leader epoch in zookeeper to send leaderandisrrequest to each replica in AR.
  3. Set the replica in Rar-oar to the Newreplica state.
  4. Wait until all replica in the RAR are synchronized with their leader.
  5. Set all replica in the RAR to the Onlinereplica state.
  6. Set the AR in the cache to RAR.
  7. If the leader is not in the RAR, re-elect a new leader from the RAR and send the leaderandisrrequest. If the new leader is not elected from the RAR, the leader epoch in zookeeper will be added.
  8. Set all replica in Oar-rar to the Offlinereplica state, which consists of two parts. First, remove the Oar-rar from the ISR on the zookeeper and send the Leaderandisrrequest to leader to notify the replica that it has been removed from the ISR, and secondly, to oar- The replica in the RAR sends Stopreplicarequest thereby stopping replica that are no longer assigned to the partition.
  9. Set all replica in Oar-rar to the Nonexistentreplica state to remove it from disk.
  10. Set the AR in zookeeper to RAR.
  11. Deleted /admin/reassign_partition .
      
    Note : The last step is to update the AR in zookeeper because this is the only place where AR is persisted, and if the controller is crash before this step, the new controller can still continue to complete the process.
    The following is a case of partition redistribution, OAR = {1,2,3},rar = {4,5,6},partition the AR and zookeeper paths in LEADER/ISR during the reassignment process are as follows
    | AR | Leader/isr | Step |
    | —————————————-|
    | {A-i} | 1/{1,2,3} | (Initial state) |
    | {1,2,3,4,5,6} | 1/{1,2,3} | (Step 2) |
    | {1,2,3,4,5,6} | 1/{1,2,3,4,5,6} | (Step 4) |
    | {1,2,3,4,5,6} | 4/{1,2,3,4,5,6} | (Step 7) |
    | {1,2,3,4,5,6} | 4/{4,5,6} | (Step 8) |
    | {4,5,6} | 4/{4,5,6} | (Step 10) |
Follower fetch data from leader

Follower Fetchrequest gets the message by sending it to leader, the fetchrequest structure is as follows

As you can see from the structure of the fetchrequest, each fetch request specifies the maximum wait time and minimum fetch bytes, as well as a map consisting of topicandpartition and Partitionfetchinfo. In fact, follower fetch data from leader data and consumer from broker is done through fetchrequest requests, so in the fetchrequest structure, one of the fields is ClientID, and its default value is Consumerconfig.defaultclientid.
  
After the leader receives the fetch request, Kafka should request it through the Kafkaapis.handlefetchrequest response process as follows:

    1. Replicamanager reads the data according to the request into the dataread.
    2. If the request is from follower, update its corresponding LEO (log end offset) and the corresponding partition's high Watermark
    3. Calculates the length of the readable message (in bytes) and into the bytesreadable according to Dataread.
    4. 1 of the following 4 conditions are met, the corresponding data is immediately returned
    • Fetch request does not want to wait, i.e. fetchrequest.macwait <= 0
    • Fetch request does not require certain to be able to fetch the message, namely Fetchrequest.numpartitions <= 0, that is, Requestinfo is empty
    • There is enough data to return, i.e. bytesreadable >= fetchrequest.minbytes
    • Exception occurred while reading data
    1. If the above 4 conditions are not met, Fetchrequest will not return immediately and encapsulate the request as a delayedfetch. Check if the Deplayedfetch is satisfied, return the request if satisfied, or add the request to the watch list

Leader returns the message to the Follower,fetchresponse structure by using the Fetchresponse form as follows

Replication Tools Topic Tool

  $KAFKA_HOME/bin/kafka-topics.sh, the tool can be used to create, delete, modify, view a topic, or to list all topic. In addition, the tool can modify the following configurations.

Unclean. Leader.election.enabledelete.retention .mssegment.jitter.msretention.mssegment.bytesflush.messagessegment.msretention.bytescleanup.policysegment .index.bytesmin.cleanable.ratiomax.message.bytesfile .delete.delay.msmin.insync< Span class= "class" >.replicasindex.interval.bytes     
Replica Verification Tool

  $KAFKA_HOME/bin/kafka-replica-verification.shThat is used to verify that all replica that correspond to each partition under one or more of the specified topic are synchronized. This parameter allows you to topic-white-list specify all the topic that you need to validate and support regular expressions.

Preferred Replica Leader election Tool

Use
With the replication mechanism, each partition may have multiple backups. The Replica list for a partition is called AR (Assigned Replicas), and the first Replica in AR is "Preferred Replica". When creating a new topic or adding partition to an existing topic, Kafka ensures that preferred replica is evenly distributed across all brokers in the cluster. Ideally, Preferred replica will be selected as leader. The above two points ensure that all partition leader are distributed evenly into the cluster, which is very important, because all the read and write operations are done by leader, if the leader distribution is too concentrated, it will cause the cluster load imbalance. However, as the cluster runs, the balance can be broken by the broker's outage, which is used to help restore the balance of leader allocations.
In fact, after each topic is recovered from the failure, it is set to the follower role by default, unless the replica of one partition is all down, and the current broker is the first partition to revert back in the replica ar. Therefore, after a partition leader (Preferred Replica) has been down and restored, it is likely that it will no longer be partition of leader, but it is still Preferred Replica.
  
Principle

    1. Create a node on Zookeeper /admin/preferred_replica_election and deposit the partition information needed to adjust preferred replica.
    2. The controller keeps watch on the node, and once the node is created, the controller receives a notification and gets the content.
    3. The controller reads preferred Replica, and if it finds that the Replica is not currently leader and it is in the partition ISR, The controller sends the leaderandisrrequest to the replica, making the replica a leader. If the replica is not currently leader and is not in the ISR, the controller will not set it to leader in order to ensure that no data is lost.

Usage
  $KAFKA_HOME/bin/kafka-preferred-replica-election.sh --zookeeper localhost:2181

On the Kafka cluster with 8 brokers, create 1 topic named Topic1,replication-factor with a 3,partition number of 8, and use the command to $KAFKA_HOME/bin/kafka-topics.sh --describe --topic topic1 --zookeeper localhost:2181 view its partition/replica distribution.

As shown in the query results, you can see that the Kafka distributes all the replica evenly across the cluster, and the leader is evenly distributed.

The

Manually stops the Partition/replica distribution of the partial broker,topic1 as shown. As can be seen, because the broker 1/2/4 are stopped, Partition 0 leader from the original 1 to 3,partition 1 leader from the original 2 to 5,partition 2 leader from the original 3 into 6,partition 3 of the leader from the original 4 into 7.


and restart the Partition/replica distribution for Broker,topic1 with ID 1 as follows. As you can see, although broker 1 has been started (1 in Partition 0 and Partition5 's ISR), 1 is not a leader of Parititon, and broker 5/6/7 is Partition of 2 leader, That is, the distribution of leader is unbalanced--a broker is a maximum of 2 partition leader, and at least 0 partition leader.


After you run the tool, the Partition/replica distribution of the TOPIC1 is as shown. It is visible from the graph that except for partition 1 and partition 3 because broker 2 and broker 4 have not yet started, so their leader is not its preferred repliac outside, All other partition leader are their preferred Replica. At the same time, the leader is more evenly distributed than before running the tool-a broker with a maximum of 2 parittion leader, at least 1 partition leader. The


start broker 2 and broker 4,leader distributions did not change compared to the previous step, as shown in.

Running the tool again, all partition leader are assumed by their preferred replica, and the leader distribution is more uniform-each broker assumes 1 partition leader roles.
  
In addition to running the tool manually to make the leader evenly distributed, Kafka also provides the ability to automatically balance the leader assignment, which is auto.leader.rebalance.enable enabled by setting it to true, which periodically checks whether the leader allocation is balanced. If the imbalance exceeds a certain threshold, the controller will automatically attempt to set the leader of each partition to its preferred Replica. The check cycle is specified by the leader.imbalance.check.interval.seconds specified, Unbalance threshold leader.imbalance.per.broker.percentage .

Kafka Reassign Partitions Tool

Use
The tool is designed to be similar to the preferred Replica Leader election tool and is designed to facilitate load balancing of Kafka clusters. The difference is that the Preferred Replica Leader election can only adjust its partition within the Leader AR range, making the Leader evenly distributed, and the tool also adjusts the partition AR.
Follower needs to fetch data from leader to keep it in sync with leader, so just maintaining the balance of the leader distribution is not enough to load balance the entire cluster. In addition, the production environment, with the increase in load, may need to Kafka cluster expansion. Adding a broker to a Kafka cluster is simple and convenient, but for an existing topic, it does not automatically migrate its partition to the newly joined broker, which can be achieved with this tool. In some scenarios, the actual load may be much smaller than the initial expected load, and this tool can be used to allocate partition on the entire cluster to some machines, and then stop the unwanted broker to achieve resource savings.
It is necessary to note that the tool can not only adjust the partition AR position, but also adjust its AR number, that is, change the topic replication factor.
  
Principle
The tool is only responsible for storing the required information in the corresponding node in zookeeper, and then exiting, not responsible for the specific operation, all adjustments are completed by the controller.

    1. Create a node on zookeeper /admin/reassign_partitions and save it to the target partition list and its corresponding target AR list.
    2. Controller registration /admin/reassign_partitions on the watch is Fire,controller get the list.
    3. For all Partition,controller in the list, do the following:
    • The RAR - AR replica in the startup, the newly assigned replica. (RAR = reassigned replicas, AR = Assigned replicas)
    • Waiting for new replica to sync with leader
    • If the leader is not in the RAR, select the new leader from the RAR
    • Stop and delete the AR - RAR replica in, that is no longer required replica
    • Delete a /admin/reassign_partitions node

Usage
The tool has three modes of use

    • Generate mode, automatically generates reassign plan (not executed) given the topic that needs to be reassigned
    • Execute mode, redistributed according to the specified reassign plan partition
    • Verify mode, verifying the success of reassigning partition

The following example will use this tool to reassign all partition of topic to broker 4/5/6/7, as follows:

    1. Use generate mode to generate reassign plan. Specify the topic ({"Topics": [{"topic": "Topic1"}], "Version": 1}) to be reassigned, and /tmp/topics-to-move.json then execute
       $KAFKA_HOME/bin/kafka-reassign-partitions.sh  --zookeeper localhost:2181  --topics-to-move-json-file /tmp/topics-to-move.json   --broker-list "4,5,6,7" --generate

The result is as shown

  
2. Execute reassign plan using the Execute mode
The reassignment plan generated in the previous step is stored /tmp/reassign-plan.json in the file and executed

    $KAFKA_HOME/bin/kafka-reassign-partitions.sh     --zookeeper localhost:2181         --reassignment-json-file /tmp/reassign-plan.json --execute

At this point, the node on the zookeeper /admin/reassign_partitions is created and its value is /tmp/reassign-plan.json consistent with the contents of the file.

3. Use Verify mode to verify that the reassign is complete. Execute Verify Command

   $KAFKA_HOME/bin/kafka-reassign-partitions.sh    --zookeeper localhost:2181 --verify   --reassignment-json-file /tmp/reassign-plan.json

The results are as follows, and it can be seen that all partititon of Topic1 are redistributed successfully.

Next use the topic tool to verify again.

    bin/kafka-topics.sh --zookeeper localhost:2181 --describe --topic topic1

As shown, it can be seen that all partition of TOPIC1 are reassigned to broker 4/5/6/7, and each partition AR is consistent with the reassign plan.

It is important to note that before using execute, it is not necessary to automatically generate the reassign plan using the Generate mode, which is only convenient with generate mode. In fact, in some scenarios, the reassign plan generated by the generate pattern does not necessarily meet the requirements, at which point the user can set up reassign plan himself.

State change Log Merge Tool

Use
The tool is designed to collect status change logs from the broker on the entire cluster and generate a centralized formatted log to help diagnose failures related to state changes. Each broker will store instructions for the status changes it receives in the state-change.log log file named. In some cases, partition's leader election may be problematic, and we need to have a global understanding of the state change of the entire cluster to diagnose the problem and resolve the issue. The tool combines the related logs in the cluster in chronological state-change.log order, supporting user input of the time range and target topic and partition as filter conditions, and eventually outputting the formatted results.
  
Usage

    bin/kafka-run-class.sh kafka.tools.StateChangeLogMerger    --logs /opt/kafka_2.11-0.8.2.1/logs/state-change.log    --topic topic1 --partitions 0,1,2,3,4,5,6,7`

Kafka Design Analysis (iii)-Kafka high Availability (lower)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.