Kafka Data Reliability Depth Interpretation __kafka

Source: Internet
Author: User
Tags zookeeper
Originally a distributed messaging system developed by LinkedIn, Kafka became part of Apache, which is written in Scala and is widely used for horizontal scaling and high throughput. At present, more and more open source distributed processing systems such as Cloudera, Apache Storm, spark support and Kafka integration. 1 overview

Kafka differs from traditional messaging systems in the following ways:

It is designed as a distributed system, which is easy to expand outwards.

It provides high throughput for both publications and subscriptions;

It supports multiple subscribers and automatically balances consumers when they fail;

It persists messages to disk, so it can be used for bulk consumption, such as ETL and live applications.

Kafka by virtue of their own advantages, more and more by the Internet enterprises, the only products will also adopt Kafka as one of its internal core message engine. Kafka as a commercial-class message middleware, the importance of message reliability is conceivable. How to ensure the accurate transmission of messages. How to ensure accurate storage of messages. How to ensure the correct consumption of messages. These are the issues that need to be considered. First of all, this paper starts from the framework of Kafka, first understand the basic principle of the next Kafka, then through the KAKFA storage mechanism, replication principle, synchronization principle, reliability and durability assurance, and so on, the reliability is analyzed, finally through the benchmark to enhance the recognition of Kafka high reliability. 2 Kafka System Architecture

As shown in the illustration above, a typical Kafka architecture includes a number of producer (can be server logs, business data, page view generated from the front end of pages, and so on), several broker (Kafka support horizontal expansion, the more general broker number, the higher the cluster throughput rate), Several consumer (group), as well as a zookeeper cluster. Kafka manages cluster configuration through zookeeper, elects leader, and rebalance when consumer group changes. Producer uses push (push) mode to publish messages to Broker,consumer to subscribe to and consume messages from broker using pull (pull) mode.

Noun Explanation:

name explain
Broker Message middleware processing node, a Kafka node is a broker, one or more broker can form a Kafka cluster
Topic Kafka classifies messages according to topic, each message that is published to the Kafka cluster needs to specify a topic
Producer Message producer, client sending a message to broker
Consumer Message consumers, clients that read messages from broker
Consumergroup Each consumer belongs to a specific consumer group, and a message can be sent to several different consumer group, but only one consumer can consume the message in one consumer group
Partition Physical concept, a topic can be divided into multiple partition, each partition internally is ordered

2.1 Topic & Partition


A topic can think of a class of messages, each topic will be divided into multiple partition, each partition at the storage level is the Append log file. Any messages published to this partition will be appended to the end of the log file, where each message's position in the file is called offset (offset), and the offset is a long number, which uniquely marks a message. Each message is append into the partition, which is the sequential write disk, so it is highly efficient (verified that sequential write disk efficiency is higher than random write memory, which is an important guarantee for Kafka high throughput).


Each message is sent to broker, and the partition rule chooses which partition to store. If the partition rules are set reasonably, all messages can be evenly distributed to different partition, thus achieving horizontal scaling. (If a topic corresponds to a file, the machine I/O for this file will be the topic performance bottleneck and partition solves the problem). You can specify the number of this partition in $kafka_home/config/server.properties when you create topic (see below), and of course you can modify the number of partition after topic creation.

# The default number of log partitions per topic. More partitions allow greater
# parallelism for consumption, but this'll also result in more files across
# the Brokers.
Num.partitions=3

When sending a message, you can specify the key,producer of the message according to the key and partition mechanism to determine which partition the message is sent to. The partition mechanism can be specified by specifying the Partition.class of the producer, which must implement the Kafka.producer.Partitioner interface.

For more details on topic and partition, refer to the "Kafka File storage Mechanism" section below. 3 High Reliability Storage analysis

Kafka's high reliability is derived from its robust copy (replication) strategy. By adjusting its copy-related parameters, Kafka can make it possible to operate between performance and reliability. KAFKA offers partition levels of replication starting with the 0.8.x version, and the number of replication can be $kafka_home/config/ Configuration (Default.replication.refactor) in Server.properties.

Here first from the Kafka file storage mechanism, from the bottom of the Kafka to understand the storage details, and then its storage has a micro-cognition. Then the concept of macro-level is expounded by Kafka copy principle and synchronous method. Finally, from the isr,hw,leader of the election and data reliability and durability assurance, and so on each dimension to enrich the knowledge of Kafka related points of understanding. 3.1 Kafka file storage mechanism

Kafka messages are categorized by topic, and producers send messages to Kafka broker through topic, and consumers read data through topic. However, topic can be grouped by partition in the physical plane, a topic can be divided into several partition, then topic and partition how to store it. Partition can also be subdivided into segment, a partition physically composed of multiple segment, so what are these segment? Here we come to the one by one announcement.

To illustrate the problem, suppose there is only one Kafka cluster, and this cluster has only one Kafka broker, that is, only one physical machine. In this KAFKA broker, configure ($KAFKA _home/config/server.properties) log.dirs=/tmp/kafka-logs to set up KAFKA message file storage directory. At the same time create a topic:topic_vms_test,partition number of 4 ($KAFKA _home/bin/kafka-topics.sh--create--zookeeper localhost:2181- Partitions 4--topic topic_vms_test--replication-factor 4). So we can now see in the/tmp/kafka-logs directory that 4 directories were generated:

Drwxr-xr-x 2 root 4096 Apr 16:10 topic_vms_test-0
drwxr-xr-x 2 root root 4096 Apr 16:10 Topic_vms_test-1
  drwxr-xr-x 2 root 4096 Apr 16:10 topic_vms_test-2
drwxr-xr-x 2 root root 4096 Apr 16:10 topic_vms_test-3

In Kafka file storage, there are several different partition under the same topic, each Partiton is a directory, partition name rule is: Topic name + ordered serial number, the first ordinal number starts from 0, The largest serial number is partition quantity minus 1,partition is the actual physical concept, and topic is the logical concept.

The above mentioned partition can also be subdivided into segment, this segment is what. If we take partition as the smallest storage unit, we can imagine that when Kafka producer constantly send messages, it will inevitably cause an unlimited expansion of partition files, which has a serious impact on the maintenance of message files and the cleanup of messages that have been consumed. So here in segment as a unit and will partition subdivision. Each partition (directory) is equivalent to a mega-file being distributed evenly across multiple segment (segment) data files (the number of messages in each segment file is not necessarily equal) This feature also facilitates the deletion of old segment, which facilitates the cleanup of messages that have been consumed, Increase the utilization of the disk. Each partition only needs to support sequential reads and writes, and segment's file lifecycle consists of server-side configuration parameters (Log.segment.bytes,log.roll). {ms,hours}, and so on a number of parameters) decided.

The segment file consists of two parts, the. index file and the. log file, which are represented as segment index files and data files, respectively. The command rules for these two files are: partition the first segment of the global starting from 0, followed by each segment file named the offset value of the last message in the previous segment file, with a numeric size of 64 digits, 20 digit character length, and no number filled with 0. As follows:

00000000000000000000.index
00000000000000000000.log
00000000000000170410.index
00000000000000170410.log
00000000000000239430.index
00000000000000239430.log

Take the segment file above as an example to show the corresponding relationship between the segment:00000000000000170410 ". Index" File and the ". Log" file, as shown in the following figure:

As shown above, the ". Index" index file stores a large amount of metadata, and the ". Log" data file stores a large number of messages, and the metadata in the index file points to the physical offset of the message in the corresponding data file. For example, the metadata in the ". Index" index file [3, 348] represents the 3rd message in the ". Log" data file, that is, the 170410+3=170413 message is represented in the global partition, and the physical offset of the message is 348.

So how do you find the message from the partition by offset? The above illustration, for example, reads the offset=170418 message, first looks for the segment file, Where 00000000000000000000.index is the first file and the second file is 00000000000000170410.index (the starting offset is 170410+1=170411), The third file is 00000000000000239430.index (the starting offset is 239430+1=239431), so this offset=170418 falls into the second file. Other subsequent files can be ordered by analogy to name and arrange the files in actual offsets, and can be quickly positioned to a specific file location based on the binary lookup method. The second is to read from the 00000000000000170410.index file in the position of [8,1325] to 1325 in the 00000000000000170410.log file.

If you read the offset=170418 message and read from 1325 of the 00000000000000170410.log file, how do you know when to read this message, or you will read the next message. This requires contacting the physical structure of the message, which has a fixed physical structure, including offset (8 Bytes), Message body size (4 Bytes), CRC32 (4 Bytes), Magic (1 byte), attributes (1 byte), Key Length (4 Bytes), key (K Bytes), payload (N Bytes), and so on, can determine the size of a message, that is, where the read is due. 3.2 Principle of replication and synchronous mode

Each partition in the Kafka has a topic log file, although partition can continue to be subdivided into several segment files, But for the upper application, partition can be viewed as the smallest storage unit (a "mega" file with multiple segment files), each partition consisting of ordered, immutable messages that are appended to the partition consecutively.

There are two words in the picture above: HW and Leo. Here we introduce the initials of the next Leo,logendoffset, which represents the location of the last message for each partition log. HW is the abbreviation of Highwatermark, refers to consumer can see this partition position, this involves the concept of multiple copies, here first mention, the next section to detail the table.

In order to improve the reliability of the message, Kafka has n replicas (replicas) for each topic partition, where n (greater than or equal to 1) is the number of topic replication factors (replica Fator). Kafka realizes the automatic failover through the multiple copy mechanism, and still guarantees the service availability when a broker fails in the Kafka cluster. When replication occurs in Kafka, ensure that partition logs are sequentially written to other nodes, n replicas, one replica is leader, the other is follower, and leader handles all read and write requests for partition. At the same time, follower will passively replicate the leader data on a regular basis.

As shown in the following illustration, there are 4 broker in the Kafka cluster, one topic has 3 partition, and the replication factor is the number of replicas 3:

Kafka provides a data replication algorithm to ensure that if a leader fails or hangs, a new leader is elected and the message received by the client is successfully written. Kafka Make sure that you elect a copy from the synchronized copy list to leader, or that follower catch leader data. Leader is responsible for maintaining and tracking the ISR (In-sync replicas abbreviation, which represents the replica synchronization queue, which can be referenced in the following section) for all follower lag states. When producer sends a message to broker, leader writes the message and copies it to all follower. After the message was submitted, it was successfully replicated to all synchronized replicas. Message replication latency is subject to the slowest follower limit, and it is important to quickly detect slow copies, and leader will remove them from the ISR if follower is "behind" too much or fails. 3.3 ISR

In the last section we refer to the ISR (In-sync replicas), which is the replica synchronization queue. The number of replicas has a certain effect on the throughput rate of Kafka, but it greatly enhances usability. By default, the number of replica for Kafka is 1, that is, each partition has a unique leader, in order to ensure the reliability of the message, it is usually applied in the value ( specified by the argument offsets.topic.replication.factor of the broker) size is set to greater than 1, such as 3. All replicas (replicas) are collectively referred to as assigned replicas, or AR.

The ISR is a subset of AR that maintains the ISR list by leader, Follower synchronization of data from leader has some latency (including latency replica.lag.time.max.ms and delay bars replica.lag.max.messages two dimensions, The replica.lag.time.max.ms dimension is only supported in the latest version 0.10.x, and any one exceeding the threshold will remove the follower from the ISR and be stored in the OSR (Outof-sync replicas) list, The newly added follower will also be stored in the OSR first. Ar=isr+osr.

The replica.lag.max.messages parameter was removed after the Kafka 0.10.x version and only the replica.lag.time.max.ms as a copy management parameter in the ISR was retained. Why do you do that? Replica.lag.max.messages that the number of messages that are behind the leaeder of a current replica exceeds the value of this parameter, then leader deletes follower from the ISR. Assuming the setting is replica.lag.max.messages=4, if the number of messages sent to broker at once producer is less than 4, Because after leader received the message sent by producer and the follower copy started pulling the messages, the number of messages behind leader was not more than 4 messages, so no follower moved out of the ISR. So the replica.lag.max.message setting seems reasonable.

But producer initiates instantaneous peak flow, producer send more than 4 messages a time, that is, more than replica.lag.max.messages, at this time follower will be considered to be and leader copy is not synchronized, and thus was kicked out of the ISR. But in practice these follower are alive and without performance problems. Then after catching up with leader, and was rejoin the ISR. So it appears that they are constantly out of the ISR and then back to the ISR, which undoubtedly adds to the unnecessary loss of performance. And this parameter is broker-Global. The settings are too large to affect the removal of the true "backward" follower, and the settings are too small to lead to frequent follower. The value of a suitable replica.lag.max.messages cannot be given, so the new version of Kafka removes the parameter.

Note: The ISR includes: leader and follower.

The above section also deals with a concept, that is, HW. HW commonly known as high water level, highwatermark abbreviation, take a partition corresponding to the smallest of the ISR Leo as Hw,consumer can only be consumed to the location of HW. In addition, each replica has Hw,leader and follower are responsible for updating their own HW state. For leader newly written messages, consumer cannot consume immediately, leader will wait for the message to be updated by the replicas in all ISR, and the message can be consumer consumed. This ensures that if leader's broker fails, the message can still be retrieved from the newly elected leader. There is no HW limit for read requests from internal broker.

The following figure describes in detail the flow of ISR and HW and Leo when producer produces messages to broker:

Thus, the Kafka replication mechanism is neither full synchronous replication nor simple asynchronous replication. In fact, synchronous replication requires that all working follower be replicated, and this message will be commit, which greatly affects throughput rates. Asynchronous replication, follower asynchronous copy of the data from the leader, as long as the data is leader write log is considered committed, in this case if follower have not replicated, behind the leader, suddenly leader downtime, Data is lost. The Kafka's use of ISR is a well-balanced way to ensure data is not lost and throughput.

The management of Kafka's ISR will eventually be fed back to the zookeeper node. The specific position is:/brokers/topics/[topic]/partitions/[partition]/state. There are currently two places to maintain this zookeeper node:

Controller to maintain: one of the broker in the Kafka cluster will be elected as controller, primarily responsible for partition management and replica state management, and will perform administrative tasks similar to the redistribution of partition. Under certain specific conditions, the leaderselector of Controller will elect the new LEADER,ISR and the new Leader_epoch and Controller_epoch to the associated node of the zookeeper. Also initiate leaderandisrrequest to notify all replicas.

Leader to maintain: leader has a separate thread that periodically detects whether follower is detached from the ISR in the ISR, and if an ISR change is found, the information for the new ISR is returned to zookeeper's associated node. 3.4 Data reliability and durability Assurance

When producer sends data to leader, you can set the level of data reliability by using the Request.required.acks parameter:

1 (default): This means producer sends the next message after the data received by the leader in the ISR has been successfully confirmed. If the leader is down, data will be lost.

0: This means that producer does not have to wait for confirmation from broker to continue sending the next batch of messages. In this case the data transfer efficiency is the highest, but the reliability is the lowest.

-1:producer needs to wait for all follower in the ISR to confirm that the data is received before it is sent once, with the highest reliability. However, this does not guarantee that the data will not be lost, for example, when only leader in the ISR (in the previous ISR section, the members of the ISR are less likely to increase in some cases, at least one leader), which becomes Acks=1.

If you want to improve the reliability of the data, set the Request.required.acks=-1 at the same time, but also to min.insync.replicas this parameter (can be set at the broker or topic level), so as to maximize the effect. Min.insync.replicas This parameter sets the minimum number of replicas in the ISR, the default value is 1, and this parameter takes effect only if the Request.required.acks parameter is set to 1. If the number of replicas in the ISR is less than the number of Min.insync.replicas configurations, the client returns an exception: Org.apache.kafka.common.errors.NotEnoughReplicasExceptoin: Messages are rejected since there are-fewer in-sync replicas than.

The next two scenarios for Acks=1 and 1 are analyzed in detail:

1. Request.required.acks=1

Producer send data to Leader,leader write local log successful, return client success, at this time, the copy in the ISR has not been able to pull the message, leader is down, then the message will be lost.

2. Request.required.acks=-1

Synchronization (Kafka default is synchronous, that is, Producer.type=sync) send mode, replication.factor>=2 and min.insync.replicas>=2 case, will not lose data.

There are two kinds of typical situations. Acks=-1 case (if no special instructions, the following acks are represented as parameter request.required.acks), the data sent to the leader, ISR follower all completed data synchronization, leader at this time, the dead, Then a new leader will be elected and the data will not be lost.

In the case of Acks=-1, when the data is sent to the leader, a partial ISR is synchronized and leader is hung at this time. For example, follower1h and Follower2 are likely to become new leader, the producer end will get the return exception, the producer end will resend the data, the data may be repeated.

Of course, if, in the leader crash, Follower2 is not synchronized to any data, and Follower2 is elected as the new leader, the message will not be repeated.

Note: Kafka only deals with fail/recover problems and does not deal with Byzantine problems. 3.5 Further discussion about HW

Consider the other situation in the image above (i.e. Acks=-1, partial ISR Sync), and if leader hangs, Follower1 synchronizes the message 4,5,follower2 Sync message 4 While Follower2 is elected leader. So what to do with the extra message 5 in Follower1?

There is a need for HW coordination. As mentioned earlier, in a list of ISR in a partition, leader HW is the smallest of all the replicas in the ISR lists. Similar to the cask principle, the water level is determined by the lowest short plate.

As shown above, a partition of a topic has three copies, a, B, and C respectively. A as leader is certainly the highest Leo, B followed, C machine due to the configuration is relatively low, the network is poor, so the slowest synchronization. This time a machine downtime, at this point if B become leader, if there is no HW, after a restore will do synchronization (Makefollower) operation, in the log file after the failure of direct do additional operations, and if B's Leo has reached A's Leo, will produce inconsistent data , so use HW to avoid this situation. A in the synchronous operation, the log file is truncated to its previous position of HW, that is, 3, and then pull the message from B to synchronize.

If the failed follower recovers, it truncates its log file to the HW location of the last checkpointed moment and then synchronizes the message from the leader. Leader hangs out will be re-election, new

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.