Analytical analysis of Kafka design-Kafka ha high Availability

Source: Internet
Author: User
Tags error code failover rar zookeeper
Questions Guide
1. How to create/delete topic.
What processes are included in the 2.Broker response request.
How the 3.LeaderAndIsrRequest responds.

This article forwards the original link http://www.jasongj.com/2015/06/08/KafkaColumn3
In this paper, based on the previous article, the HA mechanism of Kafka is explained in detail, and the various HA related scenarios such as broker Failover,controller Failover,topic creation/deletion, broker initiating, Follower a detailed process from leader fetch data. It also introduces the replication related tools provided by Kafka, such as redistribution partition, etc.
Broker Failover Process

Controller to broker failure process controller to register watch on the/brokers/ids node of zookeeper. Once there is a broker outage (this article uses downtime to represent any scenario where Kafka considers its broker die, including but not limited to machine power outages, network unavailability, GC-led Stop the world, process crash, etc.), The znode corresponding to the zookeeper will be automatically deleted, and zookeeper will be able to obtain the latest surviving broker list from the Watch,controller registered by the fire controller. The controller decides to set_p that the collection contains all the partition on all the brokers on the outage. For each of the partition in Set_p:
3.1 Read the partition current ISR from/brokers/topics/[topic]/partitions/[partition]/state.
3.2 Decides the new leader of the partition. If at least one replica in the current ISR still survives, select one as the new leader, and the new ISR contains all the surviving replica in the current ISR. Otherwise, select any surviving replica in the partition as the new leader and ISR (potential data loss may occur in this scenario). If all replica of the partition are down, set the new leader to-1.
3.3 Write the new LEADER,ISR and the new Leader_epoch and Controller_epoch to/brokers/topics/[topic]/partitions/[partition]/state. Note that this operation only executes if the controller version is not changed in the 3.1 to 3.3 process, otherwise jumps to 3.1. Sends the leaderandisrrequest command directly through RPC to the set_p-related broker. Controller can increase efficiency by sending multiple commands in one RPC operation.
The Broker failover sequence diagram is shown below.

The leaderandisrrequest structure is as follows
The leaderandisrresponse structure is as follows


Create/delete Topic

The controller registers watch on the/brokers/topics node of the zookeeper, and once a topic is created or deleted, the controller will get the newly created/deleted topic by watch partition/ Replica allocation. For the delete topic operation, the topic tool will save the topic name in/admin/delete_topics. If Delete.topic.enable is true, the controller registers the watch on the/admin/delete_topics fire, The controller sends a stopreplicarequest to the corresponding broker via a callback, and if False the controller does not register watch on the/admin/delete_topics and does not respond to the event. At this point the topic operation is logged only and not executed. For the Create topic operation, the controller reads all currently available broker lists from/brokers/ids for each partition in Set_p:
3.1 Select one of the available brokers from all replica (called AR) assigned to the partition as the new leader and set AR as the new ISR (because the topic is newly created, so all replica in AR have no data, It can be thought that they are all synchronous, that is, all in the ISR, any one of the replica can be used as leader)
3.2 Writes the new leader and ISR to/brokers/topics/[topic]/partitions/[partition] sends Leaderandisrrequest directly to the relevant broker via RPC.
The Create topic sequence diagram is shown below.


Broker Response Request Process

Broker accepts various requests and responds through Kafka.network.SocketServer and related modules. The entire network communication module is based on Java NIO Development, and employs the reactor model, which contains 1 acceptor responsible for accepting customer requests, n processor is responsible for reading and writing data, M handler processing business logic.
Acceptor's primary responsibility is to listen and accept connection requests from the client (the requesting initiator, including but not limited to the Producer,consumer,controller,admin Tool), and to establish and client-side data transfer channels, It then assigns a processor to the client, which ends the task for that client's request, and it can respond to the next client's connection request. The core code is as follows.



  
Processor is primarily responsible for reading data from the client and returning the response to the client, which itself does not handle the specific business logic and maintains a queue internally to hold all the socketchannel assigned to it. The processor Run method loops out the new Socketchannel from the queue and registers its selectionkey.op_read with selector, then loops over the ready read (request) and write (response). After the processor has read the data, it encapsulates it as a request object and gives it to Requestchannel.
Requestchannel is where processor and Kafkarequesthandler exchange data, and it contains a queue requestqueue to store the request for processor, Kafkarequesthandler will remove the request from the inside, and it also contains a respondqueue, which is used to store the response returned to the client Kafkarequesthandler after the request has been processed.
The processor will take the response of ResponseQueue saved in Requestchannel by Processnewresponses method, and the corresponding SELECTIONKEY.OP_ The Write event is registered on the selector. When the selector Select method returns, the Write method is called to return response to the client when the writable channel is detected.
The Kafkarequesthandler loop takes the request from the Requestchannel and hands it to Kafka.server.KafkaApis to handle the specific business logic.

Leaderandisrrequest Response Process

For the received leaderandisrrequest,broker mainly through the Replicamanager becomeleaderorfollower processing, the process is as follows: If the Controllerepoch in the request is less than the current Controllerepoch, the Errormapping.stalecontrollerepochcode is returned directly. For each element in the Partitionstateinfos in the request, that is ((topic, PartitionID), Partitionstateinfo):
2.1 If the leader epoch in the Partitionstateinfo is greater than the PartitionID epoch of partition (topic, leader) stored in the current Replicmanager, then:
2.1.1 If the current Brokerid (or replica ID) is in Partitionstateinfo, The partition and partitionstateinfo are stored in a hashmap named Partitionstate.
2.1.2 Otherwise, the broker is not in the replica list assigned by the partition, and the information is recorded in log
2.2 Otherwise, the corresponding error code (ERRORMAPPING.STALELEADEREPOCHCODE) is stored in response to filter out the Partitionstate leader and the current broker All records with the same ID are deposited in Partitionstobeleader, and other records are deposited into the partitionstobefollower. If the partitionstobeleader is not empty, the Makeleaders party is executed against it. If Partitionstobefollower is not empty, the Makefollowers method is executed against it. If the Highwatermak thread has not started, start it and set hwthreadinitialized to True. Turn off all idle states of the fetcher.
The leaderandisrrequest process is shown in the following figure



Broker startup Process

After the broker starts, it first creates a temporary child node (ephemeral node) under the/brokers/idszonde of zookeeper, based on its ID. After a successful creation, the controller's Replicastatemachine registers the broker change watch on it to be fire, thus completing the following steps through the callback Kafkacontroller.onbrokerstartup method: Send Updatemetadatarequest to all newly started brokers, which are defined as follows.
All replica on the newly started broker are set to the Onlinereplica state, and the broker initiates the high watermark thread for those partition. Trigger Onlinepartitionstatechange via Partitionstatemachine.


Controller Failover

The controller also needs failover. Each broker will register a watch on the controller Path (/controller). When the current controller fails, the corresponding controller path will automatically disappear (because it is ephemeral Node), when the Watch is fire, all "live" The broker will be running for the new controller (creating a new controller Path), but there will only be one successful campaign (this is guaranteed by zookeeper). The winner of the campaign is the new leader, and the loser is re-registering watch on the new controller path. Because zookeeper watch is a one-time, the fire is invalidated once, so it needs to be re-registered. After the broker has successfully campaigned for a new controller, the Kafkacontroller.oncontrollerfailover method is triggered, and in this method the following actions are done: Read and add the controller Epoch. Register watch on Reassignedpartitions Path (/admin/reassign_partitions). Register watch on Preferredreplicaelection Path (/admin/preferred_replica_election). Register watch on broker Topics Patch (/brokers/topics) via Partitionstatemachine. If Delete.topic.enable is set to True (the default value is False), Partitionstatemachine registers watch on the delete topic Patch (/admin/delete_topics). Register watch on broker Ids Patch (/brokers/ids) via Replicastatemachine. Initializes the ControllerContext object, sets the current list of all topic, "Live" broker lists, all partition leader and ISR, and so on. Start Replicastatemachine and Partitionstatemachine. Set the Brokerstate state to Runningascontroller. Send leadership information for each partition to all "Live "broker. If Auto.leader.rebalance.enable is configured to True (the default value is True), the partition-rebalance thread is started. If Delete.topic.enable is set to true and there is a value in delete topic Patch (/admin/delete_topics), the corresponding topic is deleted.

Partition re-allocation

After the management tool issues a reallocation of the partition request, the information is written to the/admin/reassign_partitions, and the action triggers the Reassignedpartitionsisrchangelistener. This is done by executing the callback function kafkacontroller.onpartitionreassignment: Zookeeper AR (current Assigned Replicas) is updated to Oar (Original list of replicas for partition) + RAR (reassigned replicas). Force update leader epoch in zookeeper to send leaderandisrrequest to each replica in AR. Set the replica in Rar-oar to the Newreplica state. Wait until all replica in the RAR are synchronized with their leader. Set all replica in the RAR to the Onlinereplica state. Set the AR in the cache to RAR. If the leader is not in the RAR, re-elect a new leader from the RAR and send the leaderandisrrequest. If the new leader is not elected from the RAR, the leader epoch in zookeeper will be added. Set all replica in Oar-rar to the Offlinereplica state, which consists of two parts. First, remove the Oar-rar from the ISR on the zookeeper and send the Leaderandisrrequest to leader to notify the replica that it has been removed from the ISR, and secondly, to oar- The replica in the RAR sends Stopreplicarequest thereby stopping replica that are no longer assigned to the partition. Set all replica in Oar-rar to the Nonexistentreplica state to remove it from disk. Set the AR in zookeeper to RAR. Delete/admin/reassign_partition.
  
Note: The last step is to update the AR in zookeeper because this is the only place where AR is persisted, and if the controller is crash before this step, the new controller can still continue to complete the process.
The following is a case of partition redistribution, OAR = {1,2,3},rar = {4,5,6},partition the AR and zookeeper paths in LEADER/ISR during the reassignment process are as follows
| AR | Leader/isr | Step |
| ————————— |
| {A-i} | 1/{1,2,3} | (Initial state) |
| {1,2,3,4,5,6} | 1/{1,2,3} | (Step 2) |
| {1,2,3,4,5,6} | 1/{1,2,3,4,5,6} | (Step 4) |
| {1,2,3,4,5,6} | 4/{1,2,3,4,5,6} | (Step 7) |
| {1,2,3,4,5,6} | 4/{4,5,6} | (Step 8) |
| {4,5,6} | 4/{4,5,6} | (Step 10) |

Follower fetch data from leader

Follower Fetchrequest gets the message by sending it to leader, the fetchrequest structure is as follows

As you can see from the structure of the fetchrequest, each fetch request specifies the maximum wait time and minimum fetch bytes, as well as a map consisting of topicandpartition and Partitionfetchinfo. In fact, follower fetch data from leader data and consumer from broker is done through fetchrequest requests, so in the fetchrequest structure, one of the fields is ClientID, and its default value is Consumerconfig.defaultclientid.
  
After leader receives the fetch request, Kafka should request through Kafkaapis.handlefetchrequest, and the response process is as follows: Replicamanager reads the data according to the request into Dataread. If the request is from follower, update its corresponding LEO (log end offset) and the corresponding partition's high Watermark based on Dataread to figure out the readable message length (in bytes) and into bytesreadable. 1 of the following 4 conditions are met, the corresponding data is immediately returned
Fetch request does not want to wait, that is, fetchrequest.macwait <= 0 fetch request does not require certain to be able to fetch the message, that is, fetchrequest.numpartitions <= 0, that is, Requestinfo is empty There is enough data to return, that is, an exception occurred when bytesreadable >= Fetchrequest.minbytes read the data
If the above 4 conditions are not met, Fetchrequest will not return immediately and encapsulate the request as a delayedfetch. Check if the Deplayedfetch is satisfied, return the request if satisfied, or add the request to the watch list
Leader returns the message to the Follower,fetchresponse structure by using the Fetchresponse form as follows

#Replication工具

Topic tool $KAFKA _home/bin/kafka-topics.sh, which can be used to create, delete, modify, view a Topic, or to list all Topic. In addition, the tool can modify the following configurations. [Bash Shell] Plain text view copy code?
01 02 03 04

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.