Distributed architecture design and high availability mechanism of Kafka

Last Update:2015-06-18 Source: Internet

Author: User

Tags sendfile set set

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Author: Wang, Josh

I. Basic overview of Kafka

1. What is Kafka?

The definition of Kafka on the Kafka website is called: adistributed publish-subscribe messaging System. Publish-subscribe is the meaning of publishing and subscribing, so it is accurate to say that Kafka is a message subscription and release system. Initially, Kafka is actually a distributed message queue used by LinkedIn for log processing, and the log data of LinkedIn is large, but the reliability requirements are not high, and its log data mainly includes user behavior (login, browse, click, Share, like) and system run log (CPU, memory, disk, Network, System, and process status).

2. What can Kafka do?

Nowadays, Kafka is mainly used to deal with active streaming data, such as analyzing user's behavior, including user's PageView (Page view), in order to be able to design a better advertisement position, to statistic the user search keyword to analyze the current trend, For example, the famous theory of long skirts in economics: if the sales of long skirts are high, the economy is depressed because the girls have no money to buy all kinds of stockings. Of course, there are some business data, if the storage database waste, and directly with the traditional storage drive is inefficient, this time, you can also use Kafka distributed to store.

3. Related Concepts in Kafka

· Broker

The Kafka cluster contains one or more servers, which are called broker. A Kafka server is a broker, a cluster consists of multiple brokers, and a broker can hold multiple topic.

· Topic

Each message published to the Kafka Cluster has a category, called Topic,kafka, that topic can be understood as a queue that stores messages. Physically different topic messages are stored separately, logically a topic message is saved on one or more brokers but the user only needs to specify the topic of the message to produce or consume data without worrying about where the data is stored.

· Partition

Partition is a physical concept, Kafka physically divides the topic into one or more partition, and each partition physically corresponds to a folder that stores all messages and index files for this partition. If you create Topic1 and Topic2 two topic with 13 and 19 partition partitions respectively, 32 folders will be generated on the entire cluster. To achieve extensibility, a very large topic can be distributed across multiple brokers, but Kafka only guarantees that messages are sent to consumer in the order of one partition, without guaranteeing the order of a topic whole (multiple partition).

· Producer

Responsible for publishing messages to Kafkabroker.

· Consumer

The message consumer, the client that reads the message to the Kafkabroker.

· Consumer Group (CG)
This is the means by which Kafka is used to implement a topic message broadcast (to all consumer) and unicast (sent to any consumer). A topic can belong to more than one CG. Topic messages will be copied (not really copied, conceptually) to all CG, but each CG will only send the message to a consumer in that CG. If you need to implement the broadcast, as long as each consumer has a separate CG on it. To achieve unicast as long as all the consumer in the same CG. CG also allows consumer to be freely grouped without having to send messages to different topic multiple times. Each consumer belongs to a specific consumer group, Kafka allows the group name to be specified for each consumer, and the default group if the group name is not specified.

4, Kafka Characteristics:

1) The cost of data access on disk is O (1), while general data is stored on disk using Btree and the access cost is O (LGN).

2) High throughput: Hundreds of message can be processed per second, even on ordinary nodes (very common hardware).

3) Explicit distribution: All producer, brokers, and consumer will have multiple, evenly distributed, and support for partitioning messages through Kafka servers and consumer clusters.

4) Support data to be loaded into Hadoop in parallel.

5) Support for message partitioning and distributed consumption between brokers, while guaranteeing the sequential transmission of messages within each partition.

6) Support offline data processing and real-time data processing: Many of the current Message Queuing services provide reliable delivery guarantees, and the default is instant consumption (not suitable for offline), while Kafka allows messages to accumulate in the system by building distributed clusters, enabling Kafka to support both offline and online log processing.

7) Scale out: Supports online horizontal scaling.

Second,the structure design of Kafka

1, the simplest Kafka deployment diagram

If the publication of a message (publish) is called a producer, the subscription (subscribe) of the message is expressed as consumer, and the intermediate storage array is called the broker, so that a simple message publishing and subscribing model can be obtained:

2, Kafka topology diagram :

Kafka is a distributed message publishing and subscribing system that displays, in addition to having multiple producer, Broker,consumer, and a zookeeper cluster for managing cooperative calls between Producer,broker and consumer.

As can be seen, a typical Kafka cluster contains a number of producer (can be the Web front-end generation of PageView, or server logs, System CPU, memory, etc.), a number of brokers (Kafka support level expansion, the more general broker number, The higher the cluster throughput, several consumer Group, and one Zookeeper cluster. Kafka manages the cluster configuration through zookeeper, elects leader, and rebalance when the consumer group is changed. Producer uses push mode to publish messages to Broker,consumer to subscribe to and consume messages from broker using pull mode.

There is a detail to note that the process of producer to broker is push, that is, the data is pushed to the broker, and the process of consumer to the broker is pull, and is actively pulling data through consumer, Instead of the broker sends the data to the consumer side actively.

3, zookeeper and producer, Broker, consumer work together

For ease of understanding, assume that there are two producer in the Kafka cluster at this time, but only one Kafka broker, zookeeper, and consumer. As shown in the deployment cluster.

1, Kafka Broker is actually Kafka Server,broker mainly do the storage, each broker start will be registered on zookeeper on a temporary broker registry, including the broker's IP address and port number, The stored topics and partitions information.

2, Zookeeper, you can think of Zookeeper as it maintains a table, recording the IP, port and other configuration information of each node.

3, Producer1,producer2,consumer Common is all configured Zkclient, more specifically, it is necessary to configure zookeeper address before running, the truth is very simple, Because the connections between them need to be zookeeper for distribution.

4, Kafka Broker and zookeeper can be placed on a machine, can also be divided into open, in addition zookeeper can also be equipped with clusters, so there will be no single point of failure.

5. After each consumer is started, a temporary consumer registry is registered on the zookeeper: consumer group that contains consumer and the topics of the subscription. Each consumer group is associated with a temporary owner registry and a persistent offset registry. For each partition that is subscribed to contains an owner registry, the content is subscribed to the consumer ID of this partition, and contains an offset registry, which is the offset of the last subscription.

The sequence of the entire system operation can be summarized as follows:

1, start the zookeeper server.

2, start the Kafka server.

3, producer if the data is produced, the broker is first found through zookeeper, and then the data is stored in the broker.

4, consumer if you want to consume data, will first find the corresponding broker through zookeeper, and then consumption.

Producer code Example:

producer = new producer (...) ;

message = new message ("Hello Ebay". GetBytes ());

set = new messageset (message);

Producer.send ("Topic1", set);

When a message is published, producer constructs a message, joins the message into the message set set (Kafka supports bulk publications, can add multiple messages to the message collection, and then publishes one time), and the client needs to specify the topic to which the message belongs when the Send message is sent.

Consumer code example:

streams[] = Consumer.createmessagestreams ("Topic1", 1);

for (message : Streams) {

bytes = message. Payload ();

//Do something with the bytes

}

When subscribing to a message, consumer needs to specify topic and partition num (each partition corresponds to a logical log stream, such as topic on behalf of a product line, partition on behalf of the product line's log by day), and after the client subscribes, You can iterate through the message, and if there is no message, the client blocks until a new message is released. Consumer can accumulate acknowledgement of the received message, and when it confirms the message of an offset, it means that the previous message was successfully received, and the broker updates the offset registry on the zookeeper.

So how do you record the status of each consumer processing information? In fact, only the offset that each consumer has processed data is saved in Kafka. This has two advantages: one is the amount of data saved, and the second is when consumer error occurs, restarting consumer processing data, just start processing data from the nearest offset.

4. Kafka Storage Policy

Kafka Broker is primarily used for storage use, and a lot of topic information can be stored on each broker.

4. Kafka Storage Policy

Kafka Broker is primarily used for storage use, and a lot of topic information can be stored on each broker.

2, each segment stores multiple messages, the message ID is determined by its logical location, that is, from the message ID can be directly located to the storage location of the message, to avoid the ID-to-location additional mapping.

3. Each partition corresponds to an index in memory, recording the first message offset in each segment.

4, producer sent to a topic message will be distributed evenly across multiple part (randomly or according to user-specified callback function distribution), Broker received a release message to the corresponding part of the last segment add the message, When the number of messages on a segment reaches the configured value or the message is published longer than the threshold, the message on segment is flush to disk, and only the message Subscribers flush to disk can subscribe to it, and segment will not write the data to that segment after reaching a certain size , the broker creates a new segment.

5, Kafka's Deliveryguarantee

On Kafka, there are two possible reasons for inefficiency: too many network requests and too many byte copies. To improve network utilization, Kafka divides the message into groups, and each request sends a set of message to the corresponding consumer. In addition, in order to reduce byte copies, using Sendfile system calls, Sendfile is much more efficient than the traditional use of sockets to send files for copying.

Kafka supports these kinds of delivery guarantee:

· Atmost once messages may be lost, but will never be transmitted repeatedly

· Atleast one message will never be lost, but may be transmitted repeatedly

· ExactlyOnce each message is bound to be transmitted once and only once, and many times this is what the user wants.

6, Kafka Multi-data center Data flow topology

In practice, it is sometimes because of security issues that you do not want a single Kafka cluster system to span multiple data centers (which can communicate between datacenters), but rather to allow Kafka to support the data flow topology of a multi-datacenter. This can be achieved by mirroring or "synchronizing" between the clusters. This feature is very simple, and the mirrored cluster is just running as the data consumer (Consumer) of the source cluster. This means that a single cluster will be able to centralize data from multiple datacenters into one location. An example of a multi-datacenter topology that can be used to support bulk load (batch loads) is shown below:

Note that there is no communication connection between the two clusters above the figure, which may be of different sizes, with a different number of nodes, and this single cluster in the following section can mirror any number of source clusters.

Three,the high availability mechanism of Kafka

1. Overview:

Kafka started to provide the high availability mechanism after the 0.8 release. Prior to this, once one or more broker outages, all partition data on it during the outage will not be consumed, and producer will no longer be able to store the data in these partition. If the broker can never recover, or a disk failure occurs, the data on it will be lost, regardless of whether producer uses synchronous mode or asynchronous mode to send data, which will cause the overall system to be less usable, because if producer uses synchronous mode, Producer will throw exception after attempting to resend the message.send.max.retries (the default is 3), the user can choose to stop sending subsequent data or choose to continue selecting send. Selecting continue sending will cause data blocking, and selecting stop sending will result in the loss of the data that should have been sent to the broker, and if producer uses asynchronous mode, Producer will attempt to resend the Message.send.max.retries (the default is 3) and record the exception and continue sending subsequent data, which can result in data loss and the user can only discover the problem through the log, and more tragically, the current Kafka producer does not provide CALs for the asynchronous Pattern Lback interface.

The HA mechanism of Kafka is mainly ensured by data replication and leader election, and data replication means that each partition may have 1 or more replica, One of the Replica will be selected as the leader node, and the remaining losers will be the follower node, where leader keeps track of the Replica list it is synchronizing with, which is called an ISR (that is, In-sync Replica).

2. Data Replication

2.1, Kafka allocation of replica algorithm:

For better load balancing, Kafka distributes all partition evenly across the cluster as much as possible. A typical deployment is a topic partition number that is larger than the number of brokers. At the same time, in order to improve the fault tolerance of Kafka, we also need to spread the same partition replica to different machines. In fact, if all the replica are on the same broker, if the broker goes down, all the replica of the partition will not work and will not achieve the HA effect. At the same time, if a broker goes down, it is necessary to ensure that the load on it can be evenly distributed to all other surviving brokers.

Kafka allocates replica as follows (assuming there are a total of n brokers):

Sort all broker and partition to be assigned
Assign the I partition to the first (i mod n) broker
The first J replica of the partition is assigned to the ((i + j) mode N) broker

2.2. The process of Kafka propagation (Propagate) messages and ACK

Producer when publishing a message to a partition, first find the leader of the partition through zookeeper, Then, regardless of how much of the topic's replicationfactor (and how many replica the partition has), producer sends the message only to the partition Leader,leader will write the message to its local log. Each follower is from leaderpull data. In this way, the data order stored by the follower is consistent with the leader. Follower sends an ACK to leader after it receives the message and writes its log. Once leader receives an ACK for all replica in the ISR, the message is considered committed, leader will increase the HW (high water level: The offset of the most recent commit message) and send an ACK to producer.

To improve performance, each follower sends an ACK to leader immediately after receiving the data, rather than waiting until the data is written to log. Therefore, for a committed message, Kafka can only guarantee that it is stored in more than one replica of memory, and not guaranteed to be persisted to disk, it is not fully guaranteed that the message will be consumer consumption after the exception occurs. But given the rarity of this scenario, it can be thought that this approach is a good balance between performance and data persistence. Consumer read the message is also read from the leader, only the message (offset lower than the HW message) will be exposed to consumer.

2.3 , the ISR replication mechanism between Partition follower and leader

Like most distributed systems, Kafka processing failure requires a clear definition of whether a broker is "alive". For Kafka, the survival of Kafka contains two conditions, one is that it must maintain the session with the zookeeper (this is achieved through the heartbeat mechanism of the zookeeper), and the follower must be able to replicate the leader messages in a timely manner. , cannot be "too far behind". Leader keeps track of the list of replica that it is synchronizing with, that is, the ISR mentioned above, and if a follower is down or too far behind, leader will remove it from the ISR. As described here, "too much lag" refers to the number of follower copied messages that fall behind leader after a predetermined value (this value can be configured in $kafka_home/config/server.properties replica.lag.max.messages , The default value is 4000) or the follower does not send a fetch request to leader for more than a certain amount of time (the value can be configured in $kafka_home/config/server.properties replica.lag.time.max.ms , and its default value is 10000).

The replication mechanism of Kafka is neither a full synchronous copy nor a simple asynchronous copy. In fact, full replication requires that all working follower be copied, and this message is considered a commit, which greatly affects throughput, while asynchronous replication follower asynchronously copying data from leader. Data is considered committed if it is leader written to log, in which case follower replication falls behind leader, and if leader suddenly goes down, data is lost. Kafka's approach to using the ISR is well balanced to ensure that data is not lost and throughput. Follower can replicate data in batches from leader, which greatly improves replication performance (bulk write disk), greatly reducing the gap between follower and leader.

It should be stated that Kafka only resolves fail/recover and does not deal with "Byzantine" ("Byzantine") issues. A message will be considered committed only if all follower in the ISR have been copied from leader (no need to write to the log of the follower itself). This avoids some of the data being written into the leader, and has not been able to be copied by any follower to go down, resulting in data loss (consumer cannot consume the data). For producer, it can choose whether to wait for a message commit, which can be request.required.acks set, which ensures that as long as the ISR has one or more follower, a commit message is not lost.

2.4. Follower fetch data from leader

Follower Fetchrequest gets the message by sending it to leader, the fetchrequest structure is as follows

As you can see from the structure of the fetchrequest, each fetch request specifies the maximum wait time and minimum fetch bytes, as well as a map consisting of topicandpartition and Partitionfetchinfo. In fact, follower fetch data from leader and consumer from broker fetch data is done through Fetchrequest request, so in the fetchrequest structure, one of the fields is ClientID, and its default value is Consumerconfig.defaultclientid. After the leader receives the fetch request, Kafka should request it through the Kafkaapis.handlefetchrequest response process as follows:

Replicamanager reads the data according to the request into the dataread.
If the request is from follower, update its corresponding LEO (log end offset) and the corresponding partition's high Watermark (offset).
Calculates the length of the readable message (in bytes) and into the bytesreadable according to Dataread.
1 of the following 4 conditions are met, the corresponding data is immediately returned
- Fetch request does not want to wait, i.e. fetchrequest.macwait <= 0
- Fetch request does not require certain to be able to fetch the message, namely Fetchrequest.numpartitions <= 0, that is, Requestinfo is empty
- There is enough data to return, i.e. bytesreadable >= fetchrequest.minbytes
- Exception occurred while reading data
If the above 4 conditions are not met, Fetchrequest will not return immediately and encapsulate the request as a delayedfetch. Check that the Deplayedfetch is satisfied and return the request if it is satisfied, otherwise add the request to the watch list.

Leader returns the message to the Follower,fetchresponse structure by using the Fetchresponse form as follows

2.5 , redistribution of partition

After the management tool or client issues a reallocation of the partition request, the information is written to the /admin/reassign_partitions top, and the action triggers the Reassignedpartitionsisrchangelistener, This is done by executing the callback function kafkacontroller.onpartitionreassignment to complete the partition redistribution, the main process is to Zookeeper AR (currentassigned Replicas) updated to Oar (originallist of replicas for partition) + RAR (reassigned replicas).

2.6. Replication Tools

1、Topic Tool

在$KAFKA_HOME/bin/Kafka-topics.sh, the tool can be used to create, delete, modify, and view configuration information about a topic, or to list all topic.

2. Replica Verification Tool

在$KAFKA_HOME/bin/Kafka-replica-verification.shThat is used to verify that all replica that correspond to each partition under one or more of the specified topic are synchronized. This parameter allows you to topic-white-list specify all the topic that you need to validate and support regular expressions.

3. Kafka Reassign Partitions Tool

The tool is designed to be similar to the Preferredreplica Leader election tool that is being mentioned later, and is designed to facilitate load balancing of Kafka clusters. The difference is that the Preferred Replica Leader election can only adjust its partition within the Leader AR range, making the Leader evenly distributed, and the tool can also adjust the partition AR, Follower needs to fetch data from leader to keep it in sync with leader, so just maintaining the balance of the leader distribution is not enough to load balance the entire cluster. In addition, in the production environment, as the load increases, it may be necessary to expand the Kafka cluster, to increase the broker to the Kafka cluster is very simple and convenient, but for the existing topic, it does not automatically migrate its partition to the newly joined broker, This tool can be used to do this at this time.

In some scenarios, the actual load may be much smaller than the initial expected load, and this tool can be used to allocate partition on the entire cluster to some machines, and then stop the unwanted broker to achieve resource savings. It is necessary to note that the tool can not only adjust the partition AR position, but also adjust its AR number, that is, change the topic replication factor. The tool has three usage modes: the Generate mode, given the topic to be reassigned, automatically generates reassign plan (not executed), and the Execute mode, which assigns partition;verify mode according to the specified reassign plan, Verify the reassignment.

The data flow for Kafka replication is as follows:

3, Leader election

3.1. Overview

After the introduction of replication, the same partition may have multiple replica, when in fact need to elect a leader from these replica, if there is not a leader, all replica can read/write data at the same time, It is necessary to ensure that multiple replica are synchronized with each other (NXN), the consistency and order of the data is very difficult to guarantee, greatly increase the complexity of the replication implementation, but also increase the probability of the occurrence of anomalies, and after the introduction of leader, Only leader is responsible for data read and write (i.e. leader is responsible for interacting with producer and consumer), while follower is responsible for the sequential fetch of data (n-Path) to leader, which makes the system simpler and more efficient.

3.2, Kafka with zookeeper and ISR selection leader

Kafka, a simple and intuitive choice leader scheme is all follower set a watch on zookeeper, once leader down, its corresponding ephemeral znode will be automatically deleted, At this point all follower try to create the node, and the creator (zookeeper guarantees that only one can be created) is the new leader, and the other replica is follower, However, since follower may be lagging behind leader many or crash, it is important to ensure that the "newest" follower is selected as the new leader. A basic principle is that if leader is absent, the new leader must have all of the original leader commit messages. This requires a tradeoff, if leader waits for more follower confirmation before marking a message, then there will be more follower as the new leader after it goes down, but it will also cause a decrease in throughput.

A very common way of electing leader is "Majority Vote" ("Minority obedience to Majority"), but Kafka does not use this approach. In this mode, if we have 2n+1 replica (including leader and follower), then there must be a n+1 replica copy of the message before commit, in order to ensure that the leader,fail of the new replica is not more than N to be properly selected. Because at least one replica in the remaining n+1 replica contains all the latest messages. There is a big advantage in this approach, and the latency of the system depends only on the fastest broker, not the slowest one. But majority vote also have some disadvantages, in order to ensure the normal leader election, it can tolerate the follower number of fail is relatively small. If you want to tolerate 2 follower hanging off, must have more than 5 replica, if you want to tolerate 3 follower hanging off, must have more than 7 replica. In other words, in order to guarantee the high degree of fault tolerance in the production environment, there must be a lot of replica, and a large number of replica will lead to a sharp decline in performance under the large data volume. This is why this algorithm is more used in zookeeper this shared cluster configuration is rarely used in systems that need to store large amounts of data. For example, the HA feature of HDFs is based on the majority-vote-based journal algorithm, but its data storage is not used in this way.

Kafka0.8 later added a controller concept, which selects a controller in all brokers, all partition leader elections are determined by the controller, the controller will change the leader directly through the RPC way ( More efficient than zookeeper queue) notifies the broker that this is to be the response. Controller is also responsible for adding and deleting topic and replica redistribution. As mentioned earlier, Kafka has dynamically maintained an ISR (In-sync replicas) in zookeeper, all replica in this ISR have been leader, and only members of the ISR have been selected as leader. In this mode, for n+1 replica, a partition can tolerate the failure of N replica without losing the already committed message. In most usage scenarios, this pattern is very advantageous. In fact, in order to tolerate the failure of n replica, Majorityvote and ISR will have to wait for the same amount of replica before commit, but the total number of replica required by the ISR is almost half the majority vote. Although majority vote has the advantage of not having to wait for the slowest broker compared to the ISR, Kafka officials believe that Kafka can improve the problem by producer choosing whether or not to be blocked by commit. And the saved replica and disks make the ISR mode still worthwhile.

3.3. Preferred Replica leaderelection Tool

With the replication mechanism, each partition may have multiple backups. The Replica list for a partition is called AR (Assignedreplicas), and the first Replica in AR is "Preferred Replica". When creating a new topic or adding partition to an existing topic, Kafka ensures that preferred replica is evenly distributed across all brokers in the cluster. Ideally, Preferred replica will be selected as leader. The above two points ensure that all partition leader are distributed evenly into the cluster, which is very important, because all the read and write operations are done by leader, if the leader distribution is too concentrated, it will cause the cluster load imbalance. However, as the cluster runs, the balance can be broken by the broker's outage, which is used to help restore the balance of leader allocations.

In fact, after each topic is recovered from the failure, it is set to the follower role by default, unless the replica of one partition is all down, and the current broker is the first partition to revert back in the replica ar. Therefore, after a partition leader (Preferred Replica) has been down and restored, it is likely that it will no longer be partition of leader, but it is still Preferred Replica.

In addition to running the tool manually to make the leader evenly distributed, Kafka allows the ability to turn on the automatic balancing leader assignment by setting the Auto.leader.rebalance.enable=true to periodically check the balance of leader allocations, If the imbalance exceeds a certain threshold, the controller will automatically attempt to set the leader of each partition to its preferred Replica. Where the check period is specified by Leader.imbalance.check.interval.seconds, the imbalance threshold is specified by Leader.imbalance.per.broker.percentage.

Summary: from the above analysis can be seen: Kafka's main design concept is to provide offline processing and real-time processing. Based on this feature, the real-time streaming system of storm can be used to process the messages in realtime, while using the batch system of Hadoop for off-line processing, while the HA mechanism of Kafka makes a good tradeoff between data integrity and throughput rate, based on the ISR sync mechanism, With a distributed messaging and subscription system with high throughput, high data integrity, and Kafka, users can simultaneously back up data to another data center in real time, just to make sure that the consumer of these three operations belong to different consumer group. Because consumer that belong to different consumer group will receive producer information posted to Kafka broker.

Resources:

Http://kafka.apache.org/documentation.html

Http://www.infoq.com/cn/articles/kafka-analysis-part-2

Http://www.open-open.com/lib/view/open1354277579741.html

Distributed architecture design and high availability mechanism of Kafka

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More