Data persistence of roaming Kafka design articles

Source: Internet
Author: User

Don't be afraid of file systems!

Kafka relies heavily on file systems to store and cache messages. The traditional idea for hard drives is that hard drives are always slow, which makes many people wonder if file system-based architectures can provide superior performance. The actual speed of the hard drive depends entirely on the way it is used. A well-designed hard drive architecture can be as fast as memory.
The linear write speed of the 6 7200-RPM SATA RAID-5 disk array is almost 600mb/s, but the speed of the write is 100k/s, which is almost 6,000 times times worse. Modern operating systems have done a lot of optimizations, using the Read-ahead and write-behind techniques, read the time into chunks of pre-read data, write when the various small trivial logical writing organizations merged into a large physical write. An in-depth discussion of this can be seen here, where they find a linear access disk that is much faster than random memory accesses.
To improve performance, modern operating systems tend to use memory as a disk cache, and modern operating systems are happy to make all free memory available as disk caches, although this may sacrifice some performance in cache recycling and redistribution. All disk read and write operations pass through this cache, which is unlikely to be bypassed unless I/O is used directly. So while each program caches only one copy of the data in its own thread, there is a copy in the operating system's cache, which is equivalent to saving two copies of the data.
In addition to discussing the JVM, the following two facts are well known:

    • Java objects occupy a very large space, almost twice times more or even higher than the data to be stored.
    • As the amount of data in the heap increases, garbage collection becomes more and more difficult.

Based on the above analysis, if the data is cached in memory, because the need to store two copies, have to use twice times the memory space, Kafka based on the JVM, but also have to double the space again, and to avoid the performance impact of GC, in a 32G memory of the machine, to use the 28-30g memory space. And when the system restarts, you have to brush the data into memory (10GB memory for almost 10 minutes), even if the use of cold refresh (not a one-time brush into the memory, but in the use of the data without a brush to memory) will also cause the initial time of the new can be very slow. However, with the file system, you do not need to refresh the data even if the system restarts. The use of file systems also simplifies the logic of maintaining data consistency.

So unlike the traditional design that caches data in memory and then brushes it to the hard disk, Kafka writes the data directly to the file system's log.

Operation efficiency of constant time

In most messaging systems, data persistence is often a mechanism for each cosumer to provide a B-tree or other random read-write data structure. B-Tree is great, of course, but it also comes with some price: for example, the complexity of B-Tree is O (log n), O (log n) is often considered to be a constant complexity, but not for hard disk operations. A search on a disk takes 10ms, and each hard drive can search only once at a time, so concurrent processing becomes a problem. Although the storage system uses caching for a lot of optimizations, the observation of the performance of the tree structure shows that its performance tends to decrease linearly as the data grows, and the data grows one time, and the speed is reduced by one-fold.
Intuitively speaking, for a message system that is primarily used for log processing, data persistence can be done simply by appending data to a file and reading it from a file. The benefit of this is that both the read and write are O (1), and the read operation does not block writes and other operations. The performance benefits are obvious because the performance and size of the data are not related.
Since it is possible to build a message system with a hard disk space that has little capacity limitations (relative to memory), you can provide features that are not available in the general messaging system without a performance penalty. For example, the general message system is deleted immediately after the message is consumed, but Kafka can save the message for a period of time (for example, a week), which gives consumer good maneuverability and flexibility, as detailed in future articles.

Data persistence of roaming Kafka design articles

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.