In-depth understanding of Kafka design principles

Source: Internet
Author: User
Tags zookeeper client

Recently opened research Kafka, the following share the Kafka design principle. Kafka is designed to be a unified information gathering platform that collects feedback in real time and needs to be able to support large volumes of data with good fault tolerance.

1 , Persistence

Kafka using files to store messages directly determines that Kafka relies heavily on the performance of the file system itself. And no matter what the OS, the optimization of the file system itself is almost impossible. File caching/Direct memory mapping is a common tool. Because Kafka is a append operation on a log file , so the cost of disk retrieval is small, and in order to reduce the number of disk writes, the broker temporarily buffer the message and flush to disk when the number of messages (or size) reaches a certain threshold, thus reducing the number of disk IO calls.
2. Performance

There are many performance points to consider, and in addition to disk IO, we need to consider network IO, which is directly related to the Kafka throughput problem. Kafka does not offer much skill; for producer, you can buffer the message When the number of messages reaches a certain threshold, it is sent to broker in bulk; the same is true for consumer, where multiple fetch messages are batched. However, the size of the message volume can be specified by a configuration file. For the Kafka broker side, There seems to be a sendfile system call that can potentially improve the performance of the network IO: Map the file's data into system memory, and the socket reads the corresponding memory area directly without having to copy and swap the process again. In fact, for producer/consumer/broker, CPU expenditure should be small, so enabling the message compression mechanism is a good strategy, compression requires a small amount of CPU resources, but for Kafka, Network IO should be considered more. You can compress any messages that are transmitted over the network. Kafka supports a variety of compression methods such as Gzip/snappy.

3 , producers

Load Balancer: The producer will keep the socket connection with all partition leader under topic, and the message is sent directly from the socket to the broker by producer, without any "routing layer" in between. In fact, there is a producer client decision on which partition the message is routed to. For example, "random" "Key-hash" "polling" and so on, if there is more than one partitions in a topic, then the producer end is implemented " Message equalization Distribution "is necessary.

where partition leader's location (Host:port) is registered in zookeeper, producer as zookeeper client, has registered watch to monitor partition The change event for the leader.

Asynchronous Send: A number of messages in the client buffer at the moment, and send them in bulk to broker, small data io too much, will slow down the overall network latency, batch delay delivery in fact, improve network efficiency. However, there are some pitfalls, such as when producer fails, messages that have not yet been sent will be lost.

4 , Consumers

Consumer the end sends a "fetch" request to the broker and informs it to get the offset of the message, and thereafter the consumer will get a certain number of messages; the consumer can reset offset to re-consume the message.

in the JMS implementation, the topic model is based on push, which is where the broker pushes the message to the consumer side. However, in Kafka, the Pull method is used, that is, after consumer has established a connection with the broker, Take the initiative to pull (or fetch) the message, the model has some advantages, the first consumer can be based on their own consumption ability to fetch messages and processing timely, and can control the progress of the message consumption (offset), in addition, consumers can well control the amount of information consumption, Batch fetch.

for Other JMS implementations, the location of the message consumption is prodiver reserved in order to avoid repeating messages or resend messages that do not have a successful consumption, and also to control the status of the messages. This requires jmsbroker to require too much extra work. In Kafka, Partition in the message only a consumer in consumption, and there is no control of the state of the message, there is no complex message confirmation mechanism, the Kafka broker side is quite lightweight. When the message is received by consumer, Consumer can save the offset of the last message locally and register the offset with zookeeper intermittently. This shows that the consumer client is also lightweight.

650) this.width=650; "src=" http://s3.51cto.com/wyfs02/M00/5C/09/wKioL1UaBHaROrpIAADyg-YQPec700.jpg "title=" 1.png " alt= "Wkiol1uabharorpiaadyg-yqpec700.jpg"/>

5 , message delivery mechanism

for JMS implementations, the message transfer guarantee is straightforward: there is only one time (exactly once). Slightly different in Kafka:

1) at the most once: up to once, this is similar to the "non-persistent" message in JMS. Send once, regardless of success or failure, will not resend.

2) At least once: the message is sent at least once, and if the message fails to accept success, it may be re-sent until it is successfully received.

3) Exactly once: messages are sent only once.

At the most once: The consumer will fetch the message, then save the offset, and then process the message, and when the client saves the offset, but an exception occurs during the message processing, some messages fail to continue processing. Then the "unhandled" message cannot be fetch, which is "at Most once ".

At least once: The consumer will fetch the message, then process the message, and then save the offset. If the message processing succeeds, but the zookeeper exception during the Save offset phase causes the save operation to fail successfully, which causes the last processed message to be obtained when the fetch is next , this is "at least once", the reason that offset is not submitted to zookeeper,zookeeper in a timely manner to restore normal or previous offset status.

exactly Once:kafka is not strictly implemented (based on 2-phase commit, transaction), we think this strategy is not necessary in Kafka.

usually "at-least-once" is our search option. The data received is always better than the lost data compared to the at once.

6 , copy backup

Kafka copy each partition data to multiple servers, any one partition has one leader and multiple follower (can not); The number of backups can be set through the broker configuration file. Leader handles all Read-write requests, follower needs to be synchronized with leader. Follower and consumer, consumer messages are stored in a local log, and leader is responsible for tracking all follower states, if follower "lags" too much or fails, Leader will remove it from the Replicas Sync list. When all follower save a message successfully, the message is considered "committed", then consumer can consume it. Even if only one replicas instance survives, The message can still be sent and received normally, as long as the zookeeper cluster survives. (Unlike other distributed storage, such as hbase requires a "majority" to survive.)

When the leader fails, it is necessary to select a new leader in the followers, perhaps follower behind the leader, so you need to choose a "up-to-date" Follower. When choosing follower, there is a problem, that is, the number of partition leader already hosted on the new leader server, if there is too much partitionleader on one server, This means that the server will be subjected to more IO pressure. In the election of new leader, "load balancing" needs to be considered.

7. Log

If the name of a topic is "my_topic", it has 2 partitions, then the log will be saved in My_topic_0 and my_topic_1 two directories; the log file holds a sequence of "log entries "(log entries), each log entry format is" 4 bytes of number n for the length of the message "+" n bytes of message content "; Each journal has an offset to uniquely mark a message with a value of offset of 8 bytes. Represents the starting position of this message in this partition: At the physical storage level, each partition has multiple logfile (called segment). The segment file is named "Minimum offset". Kafka. For example "00000000000.kafka", where "minimum offset" Offset that represents the start message in this segment.

650) this.width=650; "src=" http://s3.51cto.com/wyfs02/M01/5C/0E/wKiom1UaA1PDMBiPAAJM_u77YQQ402.jpg "title=" 2.png " alt= "Wkiom1uaa1pdmbipaajm_u77yqq402.jpg"/>

the segments list information held in each of these partiton is stored in zookeeper.

When the segment file size reaches a certain threshold (which can be set by default 1G), a new file is created, and when the number of messages in buffer reaches the threshold, the log information is fired to flush to the log file, and if " The time difference from the most recent flush "when the threshold is reached, flush to the log file is also triggered. If the broker fails, it is highly likely that messages that have not been flush to the file will be lost. Because the server is implemented unexpectedly, it still causes the log file format to break (the end of the file), Then it is required that when the server Qidong is required to detect whether the last segment file structure is legitimate and make the necessary repairs.

When getting a message, you specify offset and maximum chunk dimensions, offset is used to denote the starting position of the message, and chunk size is used to indicate the total length of the maximum get message (the number of bars representing the message indirectly). Based on offset, You can find the segment file where this message resides, and then, based on the minimum offset value of segment, get its relative position in file and read the output directly.

The deletion policy for log files is simple: Start a background thread to periodically scan the log file list and delete files that have been saved longer than the threshold (depending on when the file was created). To avoid deleting files, there is still a read operation (consumer consumption), Take the Copy-on-write way.

8 , Distribution

Kafka use zookeeper to store meta information and use the Zookeeper watch mechanism to discover meta-information changes and make corresponding actions (such as consumer failure, triggering load balancing, etc.)

1) Broker noderegistry: When a kafkabroker is started, it first registers its own node information (temporary Znode) to zookeeper, and when the broker and zookeeper are disconnected, This znode will also be deleted.

format:/BROKER/IDS/[0...N]-->host:port; where [0..N] represents the broker ID, each broker's configuration file needs to specify the ID of a numeric type (global non-repeatable). The value of the Znode for this broker's host:port information.

2) Broker Topicregistry: When a broker starts, it registers its own topic and partitions information with zookeeper, which is still a temporary znode.

format:/BROKER/TOPICS/[TOPIC]/[0...N] where [0..N] represents the partition index number.

3) Consumer Andconsumer Group: when each Consumer client is created, it registers its own information with zookeeper, which is primarily for "load balancing" purposes.

Multiple consumer in a group can be interleaved to consume all partitions of a topic; In short, all partitions of this topic are guaranteed to be consumed by this group and consumed for performance reasons. Let the partition be dispersed to each consumer relatively evenly.

4) Consumer idregistry: Each Consumer has a unique ID (host:uuid, which can be specified by the configuration file or generated by the system), which is used to mark consumer information.

format:/consumers/[group_id]/ids/[consumer_id]

is still a temporary znode, the value of this node is {"topic_name": #streams ...}, which represents the topic + partitions list currently consumed by this consumer.

5) Consumer offsettracking: used to track the largest offset in the partition currently consumed by each Consumer.

format:/consumers/[group_id]/offsets/[topic]/[broker_id-partition_id]-->offset_value

This znode is a persistent node, and you can see that offset is related to group_id to show that when one consumer in group fails, the other consumer can continue to consume.

6) Partition ownerregistry: used to mark Partition by which consumer consumption. Temporary Znode

Format:/consumers/[group_id]/owners/[topic]/[broker_id-partition_id]-->consumer_node_ ID When consumer is started, the action that is triggered:

a) first carry out "Consumer ID Registry";

B) then in the "Consumer ID Registry" node, a watch is used to listen for "leave" and "join" of the other Consumer in the current group, as long as this znode path under the node list changes, Will trigger load balancing for the consumer under this group. (such as a consumer failure, then other consumer take over partitions).

C) under the "Broker ID Registry" node, register a watch to monitor the broker's survival, and if the broker list changes, it will trigger all consumer balance under groups.

650) this.width=650; "src=" http://s3.51cto.com/wyfs02/M01/5C/09/wKioL1UaBKPAgA1WAABrDzLY2HQ368.jpg "title=" 3.png " alt= "Wkiol1uabkpaga1waabrdzly2hq368.jpg"/>

1) The Producer side uses zookeeper to "discover" the broker list, and to establish a socket connection and send messages to each partition leader under topic.

2) Broker -side uses zookeeper to register broker information and has monitored partition leader survivability.

3) The Consumer end uses zookeeper to register Consumer information, including partition lists for Consumer consumption, and also to discover broker lists, and partition Leader establishes the socket connection and obtains the message.


In-depth understanding of Kafka design principles

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.