Review efficient file read/write from Apache Kafka

Source: Internet
Author: User

Review efficient file read/write from Apache Kafka
0. Overview

Kafka said: do not be afraid of file systems.

It simply writes common files in sequence, leveraging the Page Cache of the Linux kernel, instead of memory (explicitly, there is no such thing as maintaining data in the memory and persistent data at the same time. As long as the memory is sufficient, the speed between the producer and the consumer is not significantly lower, and the read and write operations occur in the Page Cache, no synchronized disk access.

The entire IO process is divided from top to bottom into the file system layer (VFS + ext3), Page Cache layer, general data block layer, IO scheduling layer, and block device driver layer. Here, we repeat the Page Cache layer and the IO scheduling layer with Apache Kafka from the beginning, and record a popular article for Linux kernel 2.6.

1. Page Cache1.1 read/write air Relay

In Linux, the system will always move the memory that has not been used by the application to the Page Cache, enter free in the command line, or cat/proc/meminfo. The "Cached" part is the Page Cache.

Each file in the Page Cache is a Radix tree (base tree). The Node consists of 4 K pages, and the Page can be quickly located by the file offset.

When a write operation occurs, it only writes data to the Page Cache and sets the Page to the dirty flag.

When a read operation occurs, it first searches for the content in the Page Cache. If yes, it returns directly. If no, it reads the file from the disk and writes it back to the Page Cache.

It can be seen that as long as the speed difference between the producer and the consumer is not big, the consumer will directly read the data written by the former producer to the Page Cache. Everyone completes the Failover in the memory, and there is no disk access at all.

Compared with the traditional method of maintaining a message data in the memory, this will not waste twice the memory, and the Page Cache does not require GC (you can safely use 60 GB memory ), even if Kafka is restarted, the Page Cache is still in progress.

 

1.2 Background asynchronous flush Policy

This is what everyone needs most, because OS crash (not the application crash) may cause data loss if it cannot be flushed in time, and Page Cache instantly becomes a devil from friends.

Of course, Kafka is not afraid of loss, because its persistence is guaranteed by replicate. After restart, it will pull the missing data from the original replicate follower.

The kernel thread pdflush is responsible for sending pages marked by dirty to the IO scheduling layer. The kernel will initiate a pdflush thread for each disk and wake up every 5 seconds (/proc/sys/vm/dirty_writeback_centisecs). The behavior is determined based on the following three parameters:

1. if the page dirty time exceeds 30 seconds (/proc/sys/vm/dirty_expire_centiseconds, in the unit of 1% seconds), it will be flushed to the disk, therefore, crash can lose up to 30 seconds of data.

2. if the total size of the dirty page has exceeded 10% (/proc/sys/vm/dirty_background_ratio) available memory (MemFree + Cached-Mapped in cat/proc/meminfo ), the pdflush thread is started in the background to write the disk, but the current write (2) operation is not affected. Increasing or decreasing this value is the most important Optimization Method in the flush policy.

3. if wrte (2) is too fast and faster than pdflush, the dirty page rapidly increases to 20% (/proc/sys/vm/dirty_ratio) total memory (MemTotal in cat/proc/meminfo), all write operations of the application will be blocked, and flush will be executed in their own time slice, because the operating system thinks that it is too late to write a disk, if the crash will lose too much data, let everyone calm down. This cost is a bit high and should be avoided as much as possible. Before Redis2.8, Rewrite AOF often caused this large area of blocking. Now we have changed to Redis to take the initiative to flush () every 32 MB.

For more information, see The Linux Page Cache and pdflush.
 

1.3 active flush

For important data, the application needs to trigger flush to ensure that the disk is written.

1. The system calls fsync () and fdatasync ()

Fsync (fd) sends write requests for all dirty pages belonging to the file descriptor to the IO scheduling layer.

Fsync () always flush the file content and file metadata at the same time, while fdatasync () Only flush the file content and the file metadata required for subsequent operations. The metadata includes timestamp, size, and so on. The size may be required for subsequent operations, but the timestamp is not required. Because the file metadata is stored in another place, fsync () always triggers two IO operations, and the performance is worse.

2. Set the O_SYNC, O_DSYNC, or O_DIRECT flag when opening the file.

The O_SYNC and O_DSYNC flags indicate that after each write operation, the result is returned only after the flush operation is complete. The effect is equivalent to that after the write () operation followed by a fsync () or fdatasync (), but according to the test in APUE, because the OS has been optimized, the performance will be better than self-tuning write () + fsync (), but it is much slower than write.

The O_DIRECT flag indicates direct IO, and the Page Cache is skipped completely. However, this also gave up the Cache for reading files and had to read the disk files each time. It also requires the length of all IO requests, and the offset must be an integer multiple of the size of the underlying sector. Therefore, Cache must be prepared at the application layer when direct IO is used.
 

1.4 Page Cache cleanup Policy

When the memory is full, you need to clear the Page Cache, or swap the memory occupied by the application to the file. There is a swappiness parameter (/proc/sys/vm/swappiness) to determine whether to use swap or clear the page cache. The value ranges from 0 to 100. If it is set to 0, do not use swap, this is what many optimization guides let you do, because the default value is actually 60, and Linux regards Page Cache as more important.

The Page Cache cleanup policy is an upgraded version of LRU. If LRU is used simply, some newly read data may only occupy the LRU header once. Therefore, we split the original LRU queue into two, one for the new Page, and the other for the Page that has been accessed several times. When the Page was accessed, it was placed in the new LRU queue. After several rounds of access, it was upgraded to the old LRU Queue (think about the new generation of JVM Heap ). The end of the LRU queue is cleaned up until enough memory is cleared.

 

1.5 pre-read Policy

According to the cleaning policy, if the consumer in Apache Kafka is too slow and dozens of GB of content is accumulated, the Cache will still be cleared. Then the consumer needs to read the disk.

The kernel has a dynamic and Adaptive Pre-read policy. Each Read Request attempts to pre-read more content (it is always a read operation ). If the kernel finds that a process has been using pre-read data, it will increase the size of the pre-read window (minimum 16 K, maximum 128 K). Otherwise, the pre-read window will be closed. Continuous reading of files is obviously suitable for pre-reading.

 

2. IO scheduling Layer

If all read/write requests are sent directly to the hard disk, it is too cruel for the traditional hard disk. The IO scheduling layer mainly performs two tasks: Merge and sort. Merge operations that are the same as those of adjacent sectors (512 bytes each) into one. For example, if you want to read sectors 1, 2, and 3, you can merge them into one read Sector 1-3. Sorting is to arrange all operations in a queue in the direction of the sector, so that the disk head can be moved in order, effectively reducing the slowest operation of mechanical hard disk addressing.

Sorting looks pretty good, but it may cause serious unfairness. For example, if an application writes a disk in an adjacent sector, other applications will wait, and pdflush is okay, read requests are all synchronized, which can be miserable.

All other algorithms are used to solve this problem. The default algorithm of kernel 2.6 is CFQ (completely queuing ), split the total sorting queue into one sorting queue for each read/write process, and schedule each queue with time slice, take several requests from the queue of each process in turn for execution (4 by default ).

In Apache Kafka, messages are read and written in the memory, and the pdflush kernel thread is used to write the disk in sequence. Even if one server has multiple Partition files, after merging and sorting, the system can achieve good performance. In other words, the number of Partition files does not affect the performance, and the number of files will not change to random read/write.

If it is an SSD hard disk, there is no addressing cost, and sorting seems to be unnecessary, but the merger is still a lot of help, so there is another NOOP algorithm that only merges unordered data.

Digress

In addition, there is a cache of dozens of Mb on the hard disk, the external transmission rate (bus-to-Cache) on the hard disk specification and the internal transmission rate (cached to disk) the difference is here ...... i/O scheduling layer thinks that the disk has been written, but it may still not be written. If the power is down, it depends on the battery or large capacitor on the hard disk to save its life ......

 

Kafka architecture design of the distributed publish/subscribe message system

Apache Kafka code example

Apache Kafka tutorial notes

Principles and features of Apache kafka (0.8 V)

Kafka deployment and code instance

Introduction to Kafka and establishment of Cluster Environment

For details about Kafka, click here
Kafka: click here

This article permanently updates the link address:

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.