Apche Kafka's life and death –failover mechanism

Source: Internet
Author: User

Transferred from: http://www.cnblogs.com/fxjwind/p/4972244.html

Kafka as the message middleware of high throughput, its performance, simplicity and stability become the mainstream of the current real-time stream processing framework.

Of course, in the use of Kafka also encountered a lot of problems, especially the failover problem, often bring a lot of troubles and trouble.
So in combing through the Kafka source code, as far as possible to use easy to understand the way, the Kafka occurs when the mechanism of failover explained clearly, so that everyone in the use and operation of the dimension, do know.

If you do not understand the Kafka, you can refer to https://kafka.apache.org/08/design.html first, there is a general concept.

0 background

The premise of discussing Kafka's failover here is that Kafka provides the replica mechanism after the 0.8 release.
There is no failover for version 0.7, because any broker dead will cause the above data to be unreadable, resulting in a service outage.

The following is a brief introduction to the replica mechanism and corresponding components added in 0.8,

Replica mechanism

The basic idea is similar, such as (REF.2):

The figure has 4 Kafka brokers, and Topic1 has four partition (expressed in blue) distributed on 4 brokers, leader replica;
And each partition has two follower replicas (indicated in orange), distributed in replica different from leader brokers.
This allocation algorithm is very simple, interested can refer to the design of Kafka.

Replica components

To support the Replica mechanism, the main added two components are Replica Manager and controller, such as:

Replica Manager

Each broker server creates a Replica Manager, and all of the data is read and written to pass through it.
Version 0.7, Kafka will read the data directly from the Logmanager, but after adding the replica mechanism, only leader replica can respond to read and write requests for data.
Therefore, Replica Manager needs to manage the Replica state of all partition and respond to read and write requests, as well as other Replica-related actions.

Controller

As you can see, each partition has a leader replica, and a number of follower replica, then who decides who is leader?
You say there is zookeeper, but with the ZK for each partition do elect, inefficient, and the ZK will be overwhelmed;
So now the general practice is to use only ZK to select a master node, and then the master node to do all the other arbitration work.
Kafka's approach is to select a controller in the brokers as the master node and thus arbitrate all the partition leader elections.

Below we will explain the failover mechanism from the following aspects,
Let's look at the data consistency problem from the client's point of view when Kafka occurs failover.
And then what is the impact of failover from the various important components of Kafka, Zookeeper,broker, Controller?
Finally, some tips for judging the Kafka state are given.

1 from the Client's point of view from the perspective of producer, will the data be lost?

In addition to opening the replica mechanism, it also depends on the settings of the produce request.required.acks,

    • ACKs = 0, hair on the hair, do not need an ACK, whether successful or not;
    • ACKs = 1, when the write leader replica successful return, the other replica are through the Fetcher to asynchronous update, of course, there will be a risk of data loss, if the leader data is not synchronized, leader hung, then the data will be lost;
    • ACKs =–1, to wait for all the replicas to succeed, to return; this pure synchronous write delay will be relatively high.

So, in general case, Thoughput priority, set to 1, in extreme cases, it is possible to lose data;
If you can accept a longer write delay, you can choose to set ACKs to –1.

Do you read inconsistent data from the perspective of consumer?

First, whether it is high-level or low-level consumer, we need to know how he read the data from Kafka?

Kafka's log patition exists in the file and is indexed with offset, so consumer needs the last read offset for each partition record (high-level and low-level are Kafka to help you Remember, or you remember it yourself);

So if consumer dead, after reboot only need to continue from the last offset start reading, there will be no inconsistency.

But if it is Kafka broker dead, and the partition leader switch occurs, how to ensure that this offset is still valid on the new leader?
Kafka uses a mechanism, committed offset, to ensure this consistency, such as (REF.2)

In addition to the log end offset to represent the end of log, there is also a committed offset, which indicates valid offset;
Committed offset is set to this offset only after all replica have synchronized the offset ;
So the figure of committed is 2, because the Broker3 on the replica has not completed the synchronization of offset 3;
So at this point, the message of offset 3 is not visible to consumer, and consumer can read at most only offset 2.
If at this time, leader dead, no matter which follower re-election to leader, will not affect the consistency of the data, because the consumer visible offset up to 2, and this offset on all the replica is consistent.

So under normal circumstances, when Kafka occurs failover, consumer will not read inconsistent data. Case in point is that the current leader is the only valid replica, the other replica are in a completely different state, so that the leader switch occurs, the data must be lost, and offset inconsistency occurs.

2 Zookeeper Failover

Kafka is strongly dependent on zookeeper first, so how will the data be affected when an exception occurs in zookeeper?

Zookeeper Dead

If the zookeeper Dead,broker is not bootable, the following exception is reported:

This anomaly, there may be zookeeper dead, there may be network is not connected, in short, is not even zookeeper.
This case,kafka does not work at all until it can connect to zookeeper.

Zookeeper Hang

In fact, the above situation is relatively simple, the more troublesome is zookeeper hang, you can say that Kafka more than 80% of the problems are due to this reason
Zookeeper hang for many reasons, mainly ZK overload, ZK host cpu,memeory or network resources are not enough, etc.

The main problem with zookeeper Hang is the session timeout, which triggers the following problems,

A. Controller Fail,controller a re-election and switchover, the specific process reference below.

B. Broker Fail, which causes partition leader to switch or partition offline, refer to the following procedure.

C. Broker is stuck.
This is a special case, when it appears in the Server.log will appear the following log,

Server.log:
"INFO I wrote this conflicted ephemeral node [{" Jmx_port ": 9999," timestamp ":" 1444709 63049 "," host ":" 10.151.4.136 "," Version ": 1," Port ": 9092}] at/brokers/ids/1 a while the back in a different session, hence I'll backoff for this node to is D Eleted by Zookeeper and retry (kafka.utils.zkutils$) "

The problem itself is due to a bug in Zookeeper, refer to: https://issues.apache.org/jira/browse/ZOOKEEPER-1740

The problem is, "the current behavior of zookeeper for ephemeral nodes is, thesession expiration and ephemeral node deletion is n OT an atomic operation. "
That is, ZK's session expires and ephemeral node deletion is not an atomic operation;
The following case appears:

    • In the extreme case, ZK triggered the session timeout, but did not have time to complete the deletion of the/BROKERS/IDS/1 node, it was hang, for example, to do a lot of fsync operation.
    • However, after the broker 1 receives the session timeout event, it will try to re-create the/BROKERS/IDS/1 node in ZK, but then the old node still exists, so it will get nodeexists, in fact, this is unreasonable, because since the session Tim Eout, this node should not exist.
    • The usual practice, since already exist, I do not care, what to do what to do; the problem is that a ZK recovered from Fsync Hang, he will remember a node is not deleted, then will go to the/BROKERS/IDS/1 node delete.
    • The result is that for the client, the/BROKERS/IDS/1 node does not exist, although it does not receive the event that the session expires again.

So the deal here is that when nodeexists is found, while true waits until ZK recovers from hang to delete the node, and then creates a new node successfully, only to calculate the completion;
The result of this is that the broker will also be stuck here, waiting for the node to be successfully created.

3 Broker Failover

Broker's Failover, can be divided into two processes, one is broker failure, one is broker startup.

New Add Broker

Before we talk about failover, let's start by looking at a simpler process, which is to add a new broker:
First, it is clear that the new broker will have no effect on all existing topic and partition ;
Because the assignment of all replica of a topic partition is determined at the time of creation, it does not change automatically unless you manually do reassignment.
So the new broker, all you need to do is sync the meta data, and everyone knows that a new broker will be used when you create a new topic or partition.

Broker Failure

First, it is clear that the broker failure here is not necessarily the real dead of the broker server, but refers to the ZK ephemeral node that the broker corresponds to, such as/BROKERS/IDS/1, which occurs when the session Tim Eout;
Of course, the reason for this, in addition to the server dead, there are many, such as the network is not, but we do not care, as long as the Sessioin timeout, we think this broker does not work;
The following log will appear,

Controller.log:
"INFO [Brokerchangelistener on Controller 1]: Newly added brokers:3, deleted Brokers:4, all live brokers:3,2,1 (kafka.c Ontroller. Replicastatemachine$brokerchangelistener) "
"INFO [Controller 1]: Broker failure callback for 4 (Kafka.controller.KafkaController)"

When a broker failure will affect what, in fact, for the multi-replicas scene, generally to the end customer has no impact.
It will only affect which leader replica in the broker's partitions, leader election is required, and if a new leader cannot be selected, it will result in partition offline.
Because if just follow replica failure, will not affect the status of partition, or can serve, but only available replica one less; it should be noted that Kafka is not automatically filled with the failure of replica, that is, bad one less;
But for leader replica failure, you need to re-elect leader, has been discussed earlier, the newly selected leader is to ensure the consistency of the offset;

Note: In fact, there is a prerequisite for consistency here, that is, in addition to the leader of fail, in the ISR (In-sync replicas) there are other replica; As the name implies, the ISR is the leader that can catch up with replica 。
While partition is created, it allocates an AR (assigned replicas), but during the run, there may be some replica that cannot keep up with leader for various reasons, so replica will be removed from the ISR.
So ISR <= AR;
If there are no other replica in the ISR, and allow unclean election, then you can select a leader from AR, but this must be lost data, cannot guarantee the consistency of the offset.

Broker Startup

The startup here, refers to the failover in the startup, will appear the following log,

Controller.log:
"INFO [Brokerchangelistener on Controller 1]: Newly added brokers:3, deleted Brokers:4, all live brokers:3,2,1 (kafka.c Ontroller. Replicastatemachine$brokerchangelistener) "
"INFO [Controller 1]: New Broker startup Callback
for 3 (kafka.controller.KafkaController)"

The process is also not complex, first set all the replica on the broker to online, and then trigger offline partition or new partition state into online;
So broker startup will only affect offline partition or new partition, making it possible for them to become online.
So for the normal already online partition, the effect is only one more available replica, that is still after it completes catch up, is added to the ISR.

Note:partition leader in broker failover, will not be automatically switched back immediately, this will produce the problem is that the broker between load imbalance, because all the read and write need to pass the leader.
In order to solve this problem, there is a configuration in the server configuration, auto.leader.rebalance.enable, set it to true;
In this way, the Controller initiates a scheduler thread and periodically does rebalance for each broker, discovering that if the imbalance ratio on the broker reaches a certain percentage, some of the partition's Le Ader, re-elect to the original broker.

4 Controller Failover

As explained earlier, a broker server will be chosen as the Controller, the process of this election is dependent on zookeeper ephemeral node, who can first create a node in the "/controller" directory, who is Controller
So instead, we also watch this directory to determine whether the Controller has failover or changes. When the Controller occurs failover, the following log appears:

Controller.log:
"INFO [Sessionexpirationlistener on 1], ZK expired; Shut down all controllers components and try to re-elect (Kafka.controller.kafkacontroller$sessionexpirationlistener) "

Controller mainly as master to arbitrate partition leader, and maintain partition and replicas state machine, as well as the corresponding ZK watcher registration;

The Controller's failover process is as follows:

    • Try to preempt the creation of ephemeral node in the "/controller" directory;
    • If the other broker has been created successfully, then the new controller is born, updating the current metadata;
    • If you create your own success, it means that I have become a new controller, the following will start to do initialization work,
    • Initialization is primarily to create and initialize the state machine for partition and replicas, and to set watcher for changes to partitions and brokers directories.

Can be seen, the simple Controller occurs failover, is not affect the normal data read and write, but partition leader can not be re-elected, if there is partition leader fail, will lead to partition OFFL Ine
But the controller's dead, often accompanies the broker's dead, so in the controller occurs failover the process, often appears partition offline, causes the data temporarily not to be usable.

5 Tips

Kafka provides some tools for easy viewing of information, refer to: Kafka tools

A, verify that topic is work?

The simplest way to do this is to test it with the producer and consumer console.

Producer Console, the following can be topic test to localhost, insert two message,

bin/kafka-console-producer.sh--broker-list localhost:9092--topic test This is a message This is another message

Consumer console, you can read the newly written message as follows,

bin/kafka-console-consumer.sh--zookeeper localhost:2181--topic test--from-beginning

If the entire process does not have an error, OK, it means that your topic can work.

b, and see if topic is healthy?

bin/kafka-topics.sh--describe--zookeeper localhost:2181--topic test

This will print out the detail information for topic test,

A few questions can be illustrated from this diagram:

First of all, topic has a few partitions, and replicas factor is how much, that is, how many replica?
There are 32 partitions in the figure, and two replica per partition.

Moreover, each partition replicas is assigned to which brokers, and who is the leader of the partition?
For example, the Partition0,replicas in the figure are assigned to brokers 4 and 1, where leader replica on broker 1.

Finally, is it healthy?
The following aspects indicate the degree of health,

    • The ISR is empty, indicating that this partition has been offline unable to provide services, this case in our diagram does not appear;
    • ISR has data, but ISR < replicas, this situation for the user is not aware of, but the description of some replicas has been a problem, at least temporarily unable to synchronize with leader, for example, the figure of PARTITION0,ISR only 1, indicating REPL ICA 4 has offline
    • ISR = replicas, but leader is not the first replica in replicas, this description leader has been re-selected, which may result in brokers load imbalance; for example, the Partition9,leader in the figure is 2, Instead of 3, it shows that although all of its replica are normal at the moment, there has been a re-election.

C, the last is to see Kafka's log, kafka/logs

The main is to look at Controller.log and Server.log, recording the controller and broker server logs respectively.
And then, based on the log of each exception I gave you, you can see what the problem is.

Reference

1. https://kafka.apache.org/08/design.html

2. Neha narkhede,hands-free Kafka replication:a lesson in operational simplicity

3. Kafka Tools

Apche Kafka's life and death –failover mechanism

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.