Data persistence of roaming Kafka design articles

Last Update:2015-12-17 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Don't be afraid of file systems!

Kafka relies heavily on file systems to store and cache messages. The traditional idea for hard drives is that hard drives are always slow, which makes many people wonder if file system-based architectures can provide superior performance. The actual speed of the hard drive depends entirely on the way it is used. A well-designed hard drive architecture can be as fast as memory.
The linear write speed of the 6 7200-RPM SATA RAID-5 disk array is almost 600mb/s, but the speed of the write is 100k/s, which is almost 6,000 times times worse. Modern operating systems have done a lot of optimizations, using the Read-ahead and write-behind techniques, read the time into chunks of pre-read data, write when the various small trivial logical writing organizations merged into a large physical write. An in-depth discussion of this can be seen here, where they find a linear access disk that is much faster than random memory accesses.
To improve performance, modern operating systems tend to use memory as a disk cache, and modern operating systems are happy to make all free memory available as disk caches, although this may sacrifice some performance in cache recycling and redistribution. All disk read and write operations pass through this cache, which is unlikely to be bypassed unless I/O is used directly. So while each program caches only one copy of the data in its own thread, there is a copy in the operating system's cache, which is equivalent to saving two copies of the data.
In addition to discussing the JVM, the following two facts are well known:

Java objects occupy a very large space, almost twice times more or even higher than the data to be stored.
As the amount of data in the heap increases, garbage collection becomes more and more difficult.

Based on the above analysis, if the data is cached in memory, because the need to store two copies, have to use twice times the memory space, Kafka based on the JVM, but also have to double the space again, and to avoid the performance impact of GC, in a 32G memory of the machine, to use the 28-30g memory space. And when the system restarts, you have to brush the data into memory (10GB memory for almost 10 minutes), even if the use of cold refresh (not a one-time brush into the memory, but in the use of the data without a brush to memory) will also cause the initial time of the new can be very slow. However, with the file system, you do not need to refresh the data even if the system restarts. The use of file systems also simplifies the logic of maintaining data consistency.

So unlike the traditional design that caches data in memory and then brushes it to the hard disk, Kafka writes the data directly to the file system's log.

Operation efficiency of constant time

In most messaging systems, data persistence is often a mechanism for each cosumer to provide a B-tree or other random read-write data structure. B-Tree is great, of course, but it also comes with some price: for example, the complexity of B-Tree is O (log n), O (log n) is often considered to be a constant complexity, but not for hard disk operations. A search on a disk takes 10ms, and each hard drive can search only once at a time, so concurrent processing becomes a problem. Although the storage system uses caching for a lot of optimizations, the observation of the performance of the tree structure shows that its performance tends to decrease linearly as the data grows, and the data grows one time, and the speed is reduced by one-fold.
Intuitively speaking, for a message system that is primarily used for log processing, data persistence can be done simply by appending data to a file and reading it from a file. The benefit of this is that both the read and write are O (1), and the read operation does not block writes and other operations. The performance benefits are obvious because the performance and size of the data are not related.
Since it is possible to build a message system with a hard disk space that has little capacity limitations (relative to memory), you can provide features that are not available in the general messaging system without a performance penalty. For example, the general message system is deleted immediately after the message is consumed, but Kafka can save the message for a period of time (for example, a week), which gives consumer good maneuverability and flexibility, as detailed in future articles.

Data persistence of roaming Kafka design articles

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Data persistence of roaming Kafka design articles

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Data persistence of roaming Kafka design articles

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support