When storing and caching messages, Kafka relies on the file system. (Page Cache)
Linear reads and writes are one of the most predictable modes of all usage patterns, so the operating system uses pre-read (Read-ahead) and post-write (Write-behind) techniques to detect and optimize disk reads and writes. Pre-reading is to read the contents of a larger disk block into memory in advance, and the latter is to combine some smaller logical writes into a larger physical write operation.
Using a file system and relying on page caching is better than maintaining a cache or any other structure in memory.
By automatically having access to all free memory, we doubled the available cache size, and then, by saving the compressed byte structure instead of a single object, the available size of the cache could then be doubled.
This also greatly simplifies the code, because all the logic for maintaining consistency between the cache and the file system is now implemented in the OS, which is more efficient and more accurate than the one-time caches we do in the process.
If you use disk more in favor of linear read operations, with each disk read operation, read-ahead can be very efficient using the data that can then be used to populate the cache.
The data is transferred to the OS kernel's page cache and the OS then flushes the data to disk. In addition, we added a configuration-based refresh policy that allows the user to control how often the data is flushed to the physical disk (each time an n message is received or every m seconds), thereby adding an upper limit to the amount of "at risk" data when the system hardware crashes.
——————————————————————————————————————————————————
"In contrast to the Btree way"
The persistence queue can be built to the usual log solution, with simple file reads and simply adding content to the file.
While this result must not support the rich semantics of the btree implementation, one advantage is that all of its operations are in the complexity of O (1), the read operation does not need to block writes, and vice versa.
This obviously has a performance advantage, because performance is completely disconnected from the size of the data-a server can now take advantage of a large number of low-cost SATA drives with a capacity of more than 1TB. While the performance of these drive seek operations is low, these drives perform well in a large amount of data read and write, with a capacity of up to 3 times times at a price of 1/3. The ability to access virtually unlimited disk space without the cost of performance means that we can provide some of the less common features of the messaging system. For example, in Kafka, messages are not deleted immediately after they are used, but they are saved for a fairly long period of time (say, for a week).
——————————————————————————————————————————————————
The Kafka storage layout is simple. Each partition of the topic corresponds to a logical log. Physically, a log is a set of fragmented files of the same size. Each time a producer publishes a message to a partition, the agent appends the message to the last segment file. When the number of messages posted reaches the set value or after a certain amount of time, the segment file is actually written to disk. When the write is complete, the message is exposed to the consumer.
Unlike traditional messaging systems, messages stored in the Kafka system do not have a clear message ID.
The message is exposed through the logical offset in the log. This avoids the overhead of maintaining a companion dense addressing that maps the random-access index structure of the message ID to the actual message address. The message ID is incremental, but not contiguous. To calculate the ID of the next message, you can add the length of the current message based on its logical offset.
Consumers always get the message sequentially from a particular partition, and if the consumer knows the offset of a particular message, it means that the consumer has consumed all the previous messages. The consumer sends an asynchronous pull request to the proxy and prepares the byte buffer for consumption. Each asynchronous pull request contains the message offset to consume. Kafka uses the Sendfile API to efficiently distribute bytes to consumers from the agent's log segment files.
——————————————————————————————————————————————————
————————————————————————————————————————————————
"Kafka efficient file Storage design features"
Kafka the topic in a parition large file into a number of small file segments, through a number of small file segments, it is easy to periodically clear or delete already consumed files, reduce disk occupancy.
The index information allows you to quickly position the message and determine the maximum size of the response.
By mapping all the index metadata to memory, you can avoid the IO disk operation of the segment file.
By using index file sparse storage, you can significantly reduce the size of the index file metadata footprint.
————————————————————————————————————————————————
Partition:topic A physical grouping, a topic can be divided into multiple Partition, each Partition an ordered queue.
The segment:partition is physically composed of multiple Segment, which are described in detail in 2.2 and 2.3 below.
Offset: Each partition consists of a series of ordered, immutable messages that are appended to the partition consecutively. Each message in the partition has a sequential sequence number called offset, which is used to uniquely identify a message partition.
————————————————————————————————————————————————
"Kafka file storage Mechanism"
The analysis process is divided into the following 4 steps:
Partition Storage distribution in topic
How to store files in Partiton
Segment file storage structure in Partiton
How to find message by offset in partition
————————————————————————————————————————————————
How to store files in Partiton
Each partion (directory) is equivalent to a huge file that is evenly distributed across multiple equal segment (segment) data files. However, the number of segment file messages per segment is not necessarily equal, and this feature facilitates the quick deletion of old segment file.
Each partiton only needs to support sequential read and write, and the segment file lifecycle is determined by the server configuration parameters.
The advantage of this is that you can quickly delete useless files and effectively improve disk utilization.
————————————————————————————————————————————————
"Segment file storage structure in Partiton"
The reader has learned from section 2.2 How the Kafka file system partition stored, and this section delves into the composition and physical structure of segment files in Partion.
Segment file composition: consists of 2 large parts, respectively, the index file and the data file, this 2 file one by one corresponds to, in pairs appear, suffix ". Index" and ". Log" are respectively represented as segment index files, data files.
Segment file naming rules: Partion The first segment of the global, starting with 0, each subsequent segment file name is the offset value of the last message in the previous segment file. The value is a maximum of 64 bits long, a 19-digit character length, and no number is filled with 0.
The following list of files is an experiment done by the author on Kafka broker, creating a topicxxx containing 1 partition, setting each segment size to 500MB, and starting producer writing large amounts of data to Kafka broker, The list of segment files as shown in 2 shows the above 2 rules:
————————————————————————————————————————————————
2.4 How to find message via offset in partition
For example, reading the offset=368776 message needs to be found in the following 2 steps.
The first step is to find the segment file
In Figure 2 above, where 00000000000000000000.index represents the beginning of the file, the starting offset (offset) is 0. The second file, 00000000000000368769.index, has a message volume starting offset of 368770 = 368769 + 1. Similarly, the starting offset for the third file 00000000000000737337.index is 737338=737337 + 1, and other subsequent files are named and sorted at the starting offset, as long as they are found based on the offset * * * File list, you can quickly locate the specific file.
When offset=368776 is positioned to 00000000000000368769.index|log
The second step is to find the message by Segment file
The first step is to locate the segment file, when offset=368776, Navigate to the 00000000000000368769.index physical location of the metadata and the physical offset address of 00000000000000368769.log, and then find it in 00000000000000368769.log order until offset =368776 so far.
From the above Figure 3 shows the advantages of this, segment index file to take a sparse index storage, it reduces the size of index files, through mmap can direct memory operation, sparse index for each corresponding message of the data file set a metadata pointer, It saves more storage space than dense indexes, but it takes more time to find them.
————————————————————————————————————————————————
As can be seen from Figure 5 above, the Kafka runtime rarely has a large number of read disk operations, mainly regular bulk write disk operations, so the operation of the disk is very efficient.
This is closely related to the design of read and write message in the Kafka file store. Kafka read-write message has the following characteristics:
Write a message
The message is transferred from the Java heap to page cache (that is, physical memory).
The message is brushed from the page cache by the asynchronous thread brush disk.
Read message
The message is sent directly from the page cache to the socket.
When no data is found from the page cache, disk IO is generated, from the magnetic
Disk load message to page cache and then send it directly from the socket
————————————————————————————————————————————————
Kafka Message File storage