Kafka/metaq Design thought study notes turn

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Reprinted from: http://my.oschina.net/geecoodeer/blog/194829

This article does not deliberately distinguish between the differences between them, just a list of the ideas I think good design, for follow-up design reference.
At present, the author does not delve into the details of the code, if there is an incorrect place, please treatise.

Concepts and terminology

Messages, all referred to as message, refer to the transfer of data between the producer, the server, and the consumer.
Message Broker: All referred to as message broker, in layman terms refers to the MQ server or servers.
Message producer: All called message Producer, responsible for generating messages and sending messages to the Meta server.
Message consumers: All called message Consumer, responsible for the consumption of messages.
The subject of the message: all called message Topic, defined by the user and configured on the broker. Producer sends a message to a topic, consumer consumes the message from a certain topic.
Topic Partition: Also known as partition, you can divide a topic into multiple partitions. Each partition is an ordered, immutable, sequential increment of commit log
Consumer groups: All called consumer group, composed of a number of consumers, together to consume a topic under the message, each consumer consumption part of the message. These consumers form a grouping that has the same group name, often referred to as a consumer cluster
Offset: all called offset. The message in the partition has an incremented ID, which we call offset. It uniquely identifies the message in the partition.

Basic working mechanism architecture schematic

As you can see, there are 4 clusters. Among them, broker cluster exists master-slave structure.

Multiple brokers form a cluster to provide some topic services, the producer cluster can follow certain routing rules to a certain broker in the cluster to send messages to a topic, consumer cluster in accordance with certain routing rules to pull the message on a broker.

Producer, Broker, consumer processing message process

Each broker can configure how many partitions a topic can have, but in the producer's view, a topic is used to make up a partitioned list of all the partitions on all brokers.

When the producer is created, the producer obtains the publish topic corresponding broker and partition list from the zookeeper. After the producers get the partition list through ZK, they are organized into an ordered list of partitions in the order of Brokerid and partition, sending messages by selecting a partition from beginning to end.

If you want to implement your own load balancing strategy, you can implement the appropriate load balancing policy interface.

Message producers return processing results after sending a message, and the results are divided into successes, failures, and timeouts.

After receiving the message, the broker checks the checksum in turn, writes to the disk, and returns the processing result to the producer.

In each consumer message, the consumer first puts offset plus 1, then finds the corresponding message based on that offset, and then begins to consume. Only after the successful consumption of a message will continue to consume the next one. If a message fails (such as an exception), it tries to retry the consumption of the message, and after the maximum number of times it is still not consumed, the message is stored on the consumer's local disk, and the background thread continues to retry. And the main thread continues to go backwards, consuming subsequent messages.

DFX sequence of

Sequential means that if the order of sending messages is a, B, C, then the order of consumer consumption should also be a, B, C.

Within a single thread, the producer sends the message to the same partition on the same server, so that it can reach the server and store it in the order it was sent, and be consumed by the consumer in the same order.

Reliability Broker Store Message mechanism

Writing to disk does not mean that the data falls on the disk device, after all, there is a layer of os,os in the middle of the write buffer. There are typically two ways to ensure that data falls on disk: Force data is written to disk devices based on processing frequency (number of messages) or time interval.

Broker Disaster Preparedness

MySQL-like synchronous and asynchronous replication, the data of one master server is fully replicated to another slave server, and the slave server also provides consumption capabilities. In Kafka, it is described as "each server acts as a leader for some of it partitions and a follower for others so load are well balanced Within the cluster. ", simply translated, each server acts as a leader of its own partition and acts as a folloer for the partitions of other servers, thus achieving load balancing.

In theory, synchronous replication leads to higher levels of reliability, and asynchronous replication can lose a very small amount of message data due to the presence of delays, which, in turn, can result in a loss of performance because the write succeeds when it is written to two or more broker machines synchronously. In practice, the Asynchronous replication architecture is recommended because the architecture of asynchronous replication is relatively simple and easy to maintain and restore, with no impact on performance. Synchronous replication is relatively high in operation and maintenance, the mechanism is complex and error-prone, and fault recovery is more troublesome. asynchronous replication coupled with disk-made disk arrays is sufficient to handle very demanding data reliability requirements.

First copy because it takes time to fully synchronize with master, you can observe replication in the data file's directory.

Asynchronous replication of Slave can participate in consumer activities, message consumers can get messages from the slave and consumption, consumers will randomly choose from master and slaves as a consumer broker.

Performance

Reduce byte replication overhead and system call overhead with sendfile calls

Using the message Set concept for batch processing, you can increase the amount of content that is transferred over the network, reduce roundtrip overhead, and bring sequential disk operations and contiguous blocks of memory. Compression is also possible and the compression ratio is higher than a single processing.

Duplicate exception handling messages

The repetition of the message contains two aspects, and the producer repeats the message and the consumer repeats the consumer message.

For producers, this can happen, the producer sends messages, waits for the server to answer, this time a network failure occurs, the server actually writes the message successfully, but the response is not returned due to a network failure. Then the producer thinks the send fails, sends the same message again, and if the send succeeds, the server actually stores two identical messages. This repetition caused by the failure, MQ is unavoidable, because MQ does not determine whether the data of the message is consistent, because it does not understand the semantics of data, but only as a load to transport.

For consumers also have this problem, the consumer successfully consumed a message, but at this time the power outage, not in time to store the forward offset, then the next startup or other same group of consumers owner to this partition, will repeat the consumption of the message. In this case, MQ is not completely avoidable.

Load balancing and failover of producers

When the broker is unable to service due to a reboot or failure, producer will perceive this change through zookeeper and remove the failed partition from the list to fail over. Because there is a delay from failure to perceptual change, there may be a partial message delivery failure at that instant.

Maintenance of Operational Management parameters

Web management platform, accessed through the browser
Provides a restful API, which you can refer to here
Set up JMX ports to view information or modify parameters through tools such as APIs or Jconsole

Disk space Management

By default, Meta saves messages that are constantly added, and then periodically deletes or archives "out-of-date" data. You can choose when to start deleting, backing up data, deleting, and backing up data for a long time.

System design and Selection Why divide the topic into multiple partitions?

Topic is divided into multiple partitions into multiple files, which prevents the file contents of a single topic from being too large. Each partition can only be consumed by a consumer within the consumer group. In addition, you can choose to copy the partial partition of the topic to the follower to achieve load balancing and failover.

Why consumer groups are needed

First, there are two traditional models: Queue and topic. The queue guarantees that only one consumer can consume content; topic is broadcast to all consumers and let them consume.

At design time, a message can be consumed by different consumer groups and consumed only once per consumer group. This way, if there is only one consumer group, then the semantics of queue is reached, and if there are multiple consumer groups, then the semantics of topic is reached.

Why choose a page-cache-centric design

Excerpt from distributed release subscription message system Kafka architecture Design translation:
Linear write (linear write) is approximately 300mb/seconds, but writes only 50k/seconds, where the difference is nearly 10,000 times times. Linear read and write is one of the most predictable modes of all usage patterns, and the modern operating system has provided pre-read (pre-read multiple blocks, loaded into memory) and post-write (merging a set of small amounts of data to write and then write once) technology.

Modern operating systems have become increasingly active in using primary memory as a disk cache. All modern operating systems will be happy to turn all free memory into disk caches, with some performance costs in the event of the need to reclaim these memory. This unified cache is required for all disk read and write operations. It is not easy to discard this feature unless you are using direct I/O.

Therefore, for a process, even if it holds a copy of the data in the in-process cache, the data may be duplicated in the OS's page cache (Pagecache), and the structure is saved two times in a single piece of data. Also, note that the memory overhead (overhead) of Java objects is very large, often twice times (or worse) of the amount of memory that is stored in the object. Memory garbage collection in Java is becoming more and more ambiguous as data in the heap grows, and the cost of recycling becomes larger.

Because of these factors, using a file system and relying on page caching is better than maintaining a cache or structure in memory-by automatically having access to all free memory, we will at least double the available cache size, and then save the compressed byte structure instead of a single object. The size of the cache can then be doubled again. In doing so, we can get up to 28 to 30G of cache on a machine with 32G of memory in case the GC performance is not lost. Furthermore, this cache will remain valid even after a service restart, unlike in-process caching, which requires a cache rebuild in memory after a process restart (10G of cache rebuild may take up to 10 minutes) or it will need to start running with a full empty cache (so that its initial performance can be very bad). This also greatly simplifies the code, because all the logic for maintaining consistency between the cache and the file system is now implemented in the OS, which is more efficient and more accurate than the one-time caches we do in the process. If you use disk more in favor of linear read operations , with each disk read operation, read-ahead can be very efficient using the data that can then be used to populate the cache (this is the increment sequential read of offset, can read the performance of the IO).

Push vs. pull

Does the consumer take the initiative to pull messages from the broker or broker to push the message to the consumer? In fact, there are pros and cons.

Push-based systems are difficult to control the speed at which data is distributed to different consumers. May cause the consumer to overload. In this regard, pull does a better job. Consumers can control the speed of processing data themselves.

In addition, pull-based consumers can obtain data in bulk. Push-base Broker is more difficult to handle, is each send a single message or bulk send? If it is sent in bulk, how many are sent each time?

Pull is bad, if the broker has no data, pull-based consumers may be busy waiting. This problem can be solved by a "long poll" mechanism (equivalent to Java future.get).

Consumer location

Most messages use metadata to record which broker's messages are consumed. That is, when the message is passed to the consumer, the broker records it or waits for the consumer's acknowledge to record it. But there are a lot of problems here. If a message is passed over the network to the consumer, and if the consumer goes down without having time to process it, but the broker logs that the message has been consumed, the message is lost. To avoid this situation, many message systems add a acknowledge feature that identifies the message being consumed successfully. The consumer then sends the acknowledge to the broker, and the broker does not necessarily get the acknowledge, which in turn causes the message to be consumed repeatedly. Second, this approach also leads to network overhead and the server side must maintain the processing state of the message.

In a class Kafka system, a theme is composed of multiple sequential partitions. Each partition can only be consumed by one consumer at any time. This means that the consumer location inside each partition is just an integer that identifies the offset of the next consumed message. It is much easier to maintain which messages are consumed, such as by regularly setting up checkpoints.

Message distribution semantics

Class Kafka There are 3 types of protection when distributing messages:

At most once: messages may be lost but not re-sent
At least once (at least once): message cannot be lost, but may be re-sent
Almost once (exactly one): messages are distributed once and only once

Problems can be divided into two categories: the persistence of message delivery and the persistence of message consumption

There's no perfect way to deal with this. When a producer sends a message, the broker can reply to the confirmation message by setting the primary key on the message and then attempting to send it again if it fails.

When consumers consume messages, they are divided into 3 situations:

Read the message, save offset, and process the message. Then the message crashes when it is processed. For the "at most one time" scenario.
Read messages, process messages, save offset. Then save the offset crash. For the "at least once" scenario.
The classic approach is to use two-phase commit (2PC) in the two steps of saving offset and processing messages. In Kafka, an easy way is to store offset and processed results together.

Copy

Kafka can copy the partition of each topic to several servers (parameters can be equipped). Many messaging systems often require cumbersome manual configuration if they are to provide replication-related features, fearing that replication will affect throughput. In Kafka, it provides the copy feature by default – The user can set the copy silver to 1, which is equivalent to not copying.

Each partition has 1 leader and 0 or more followers.

The node in "Alive" consists of the following two conditions:

Must and ZK exist session 2. If the node is slave, it must ensure that the write replication distance is leader.

Leader holds a list of all nodes that are synchronizing. If follower dies, or is too far from leader, leader will remove it from the node. "Too far from leader" This definition can be defined by the number of deferred messages and the time parameter of the delay.

A message that can be marked as "commited" only if all In-sync replication nodes have completed replication. Only messages in "commited" can be consumed. On the other hand, producers can weigh both the delay and persistence factors, setting whether to wait for a message to be commit or wait for how many ack.

With the pull model, is the real-time message guaranteed?

The real-time nature of the message is affected by many factors, it is not easy to say that real-time is bound to decrease, the main factors are as follows

The threshold value of the bulk force message configured on the broker, the higher the threshold value of the force message, the lower the real-time.
The larger the value of the data each time the consumer fetches, the lower the real-time, but the higher the throughput.
Topic the number of partitions on the real-time also has a greater impact on the number of partitions, the larger the disk pressure, resulting in real-time message delivery decreased.
The longer the consumer retries the fetch, the more severe the delay.
Number of threads that consumers crawl data

The storage structure of the message

In Kafka, the message format is as follows

/** * A message. The formatof an NBYTE messageis the following: * *If MagicByteIs0 * *1.1Byte"Magic" identifierTo allow format changes * *2.4BYTE CRC32of the payload * *3. N-5BYTE payload * *If MagicByteIs1 * * 1. 1 byte  "Magic" identifier to allow format changes * * 2. 1 byte  "attributes" identifier Span class= "Hljs-keyword" >to allow annotations on the message independent of the version (e.g. compression enabled, type of codec used) * * 3. 4 byte CRC32 of the payload * * 4. N-6 byte payload * */

The message format on disk is as follows:

length : 4 bytes (value: 1+4+n) "magic" value : 1 bytecrc : 4 bytespayload : n bytes

The message format for Metaq is as follows

length(4 bytes),包括消息属性和payload datachecksum(4 bytes)message id(8 bytes)message flag(4 bytes)attribute length(4 bytes) + attribute，可选payload

Where checksum is calculated using the CRC32 algorithm, the content of the calculation includes the message property length + Message property +data, and the message property is not included if it does not exist. The consumer checks whether the checksum is correct after receiving the message.

The following excerpt from the Metaq documentation

There are different partitions under the same topic, each partition will be divided into multiple files, only one current file is written, and the other files are read only. When you write a file (full meaning to reach the set value) then switch the file, create a new current file to write, the old current file switch to read-only. The name of the file is named at the starting offset. Take a look at an example, assuming that the 0-0 partition under this topic may have the following files: meta-test

00000000000000000000000000000000.meta
00000000000000000000000000001024.meta
00000000000000000000000000002048.meta
......

Where 00000000000000000000000000000000.meta represents the first file with a starting offset of 0. The second file, 00000000000000000000000000001024.meta, has a starting offset of 1024 and indicates that its previous file has a size of 1024-0=1024. Similarly, the starting offset of the third file 00000000000000000000000000002048.meta is 2048, which indicates that the size of 00000000000000000000000000001024.meta is 2048-1024 = 1024.

To name and sort the files at the starting offset, it is fairly straightforward for the consumer to crawl the data at the beginning of a starting offset, as long as the file list is searched based on the offset two points passed in and the specific file is located. The absolute offset minus the file's starting node is then converted to relative offset to begin transmitting the data. For example, in the example above, assuming that the consumer wants to crawl 1M of data starting from 1536, then find it based on 15,362 points, Navigate to 00000000000000000000000000001024.meta This file (1536 between 1024 and 2048), 1536-1024 = 512, that is, the actual transfer of the starting offset is in the 00000000000000000000000000001024.meta file 512-bit.

Using broker Node Registry for zookeeper

/brokers/ids/[0 ... N]–> Host:port (Ephemeral node)
[0 ... N] represents the broker ID, each broker ID must be unique. Registration is completed when the broker is started.
meaning that each broker corresponds to a host:port

Broker Topic Registry

/brokers/topics/[topic]/[0 ... N]–> npartions (Ephemeral node)
Meaning is the number of partitions per broker ID corresponding to the subject

Consumer Id Registry

Consumer groups contain multiple consumers, and different consumers have different names. Each consumer contains a group ID attribute.

/consumers/[group_id]/ids/[consumer_id]–> {"Topic1": #streams, ..., "TOPICN": #streams} (Ephemeral node)

Meaning is a list of topic consumed by consumers under each consumer group.

Consumer Offset Tracking

/consumers/[group_id]/offsets/[topic]/[broker_id-partition_id]–> Offset_counter_value (Persistent node)

Offset_counter_value for each consumer group id-partition ID for a subject's server

Partition Owner Registry

/consumers/[group_id]/owners/[topic]/[broker_id-partition_id]–> consumer_node_id (Ephemeral node)

The implication is that some consumer_node_id of a consumer group id-partition ID consumption for a subject's server

Broker Node Registration

When the new Borker is added, it is registered under the broker node, and value is hostname and port. It also registers the topic list it contains and the partitioning of the topic. When a new theme is created, it is automatically registered to ZK.

Consumer Registration algorithm

When the consumer starts:

Register yourself in a consumer group
Under the consumer ID, register for the Monitor Change event (new consumers leave or join), each of which recalculates the consumer load under the group.
Under Broker ID, register for the Monitor Change event (new Borker left or join), each of which recalculates the consumer load for all consumer groups.
If a consumer uses the topic filter mechanism, it registers the Change event (new topic join) under Broker topic, each of which recalculates the load of the associated topic consumer.
When you join, recalculate the consumer load for the customer group.

Consumer Rebalancing algorithm

A partition can only be consumed by one consumer, which avoids unnecessary synchronization mechanisms. The specific algorithm is as follows:

For each topic T, Ci subscribes to
Let PT is all partitions producing topic T
Let CG is all consumers in the same group as Ci that consume topic T
Sort PT (so partitions on the same broker is clustered together)
Sort CG
Let I being the index position of Ci in CG and let N = Size (PT)/size (CG)
Assign partitions from I-N to (i+1)N-1 to consumer Ci
Remove current entries owned by Ci from the partition owner registry
Add newly assigned partitions to the partition owner registry (we could need to re-try this until the original partition own ER releases its ownership)

The Chinese pseudo code is as follows:

Set$topicList =$consumer. SubscrbeFor each$topicInch$topicListAll subject set for a consumer subscription$partitionList =$topic. partitionsGet all the partition set for the theme$comsumerList = ($topic. Comsumers and$consumser. Group. consumsers)Get all consumers who consume the subject and these consumers are the same group as the current consumer$partitionList.Sort()Like Broker0-p0,broker0-p1, BROKER1-P0,BROKER1-P1$comsumerList. The sort() set $consumerIndex = $comsuserList.  GetIndex($consumser) //Get the current consumer's index set $N = $partitionList in the group .  Size()/$comsuserList.  Size()//Get the number of partitions divided by the number of consumers of the quotient //Well, after a few words really did not understand, estimated to see the source, depressed. TODO

A brief introduction to ROCKETMQ

Because the present ROCKETMQ systematic introduction document is not very whole, and because the author time is limited, only is roughly overturned. Find a place where there are several values to say.

Message filtering

Support broker-side message filtering, in the broker, according to the requirements of consumer filtering, the advantage is to reduce the consumer useless message network transmission. The disadvantage is that the broker's burden is increased and the implementation is relatively complex.

Supports consumer-side message filtering. This filtering method can be fully customized by the application, but the disadvantage is that many useless messages are transferred to the consumer side.

0 Copy Selection

Consumer consumption message process, using 0 copies, 0 copies contains the following two ways

The advantage of using Mmap + write: Even if frequent calls, the use of small file transfer, the efficiency is also very high disadvantage: not good use of DMA, will be more than Sendfile CPU, memory security control complex, need to avoid JVM crash problem.
Advantages of using the Sendfile method: You can use DMA mode, consume less CPU, large file transfer efficiency, no new memory security problems. Disadvantage: Small block file efficiency is less than mmap mode, can only be bio-mode transmission, can not use NIO.
ROCKETMQ chose the first way,Mmap+write way , because there are small pieces of data transmission requirements, the effect will be better than sendfile.

Service discovery

Name server is a lightweight name service designed for ROCKETMQ, with code less than 1000 lines, with simple, clustered scale-out, stateless features. The primary and standby auto-switchover feature that will be supported is strongly dependent on name Server.

Postscript

If you do not read the source, there is always a feeling of something less.

The translation of English is still relatively stiff

The core is the unique design of the model, the use of zookeeper is very ingenious, as well as the many details of consideration. It's really a very good MQ.

Next pit, complete the reading of ZK source.

Reference

Kafka 0.8 Documentation
Metamorphosis WIKI
ROCKETMQ WIKI
Distributed publish Subscribe message system Kafka Architecture design Translation

Kafka/metaq Design thought study notes turn

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Kafka/metaq Design thought study notes turn

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Kafka/metaq Design thought study notes turn

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support