Data Persistence in roaming Kafka Design

Last Update:2014-07-10 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Reprinted with the source: http://blog.csdn.net/honglei915/article/details/37564595

Do not fear file systems!

Kafka relies heavily on the file system to store and cache messages. The traditional concept of hard disks is that hard disks are always slow, which makes many people doubt whether the file system-based architecture can provide excellent performance. In fact, the speed of a hard disk depends entirely on how it is used. A well-designed hard disk architecture can be as fast as memory.
The linear write speed of six 7200-to-600 SATA raid-5 disk arrays is almost 100 Mb/s, but the write speed is 6000 K/s, which is almost times worse. The modern operating system has made a lot of optimizations on the secondary node. It uses the read-ahead and write-behind techniques to read data in blocks, when writing, various trivial logics are written to the Organization and merged into a large physical write. In this in-depth discussion, we can see that linear access to disks is much faster than random memory access.
To improve performance, modern operating systems often use memory as the disk cache. Modern Operating Systems are happy to use all idle memory as the disk cache, although this may sacrifice some performance during cache collection and redistribution. All disk read/write operations will go through this cache, which is unlikely to be bypassed unless I/O is used directly. Therefore, although each program caches only one copy of data in its own thread, there is also one copy in the operating system cache, which is equal to storing two copies of data.
In addition, the following two facts about JVM are well known:

Java objects occupy a very large space, which is almost twice or more of the data to be stored.
As the volume of data in the heap increases, it is increasingly difficult to recycle the garbage.

Based on the above analysis, if the data is cached in the memory, because the two copies need to be stored, you have to use twice the memory space. Kafka is based on JVM and has to double the space again, in addition, to avoid GC performance impact, a 32 GB memory machine has to use 28-30 GB memory space. In addition, when the system is restarted, data must be flushed to the memory (10 Gb of memory will take about 10 minutes), even if cold refresh is used (instead of flushed into the memory at a time, but when the data is used, it will not be flushed to the memory.) It will also lead to a very slow new time. However, if you use a file system, you do not need to refresh the data even if the system is restarted. The file system simplifies the logic for maintaining data consistency.

Therefore, unlike the traditional design of caching data in the memory and then flushing it to the hard disk, Kafka directly writes the data to the log of the file system.

Constant time Operation Efficiency

In most message systems, data persistence provides a B-tree or other random read/write data structure for each cosumer. B-tree is of course great, but it also brings some costs: for example, the complexity of B-tree is O (log n), and O (log n) is usually considered as the constant complexity, however, this is not the case for hard disk operations. It takes 10 ms for a disk to be searched. Each hard disk can only be searched once at the same time, So concurrent processing becomes a problem. Although the storage system uses the cache for a lot of optimization, the observed results of the tree structure show that its performance tends to decrease linearly with the increase of data, and the data growth doubles, the speed is doubled.
Intuitively speaking, for a message system mainly used for log processing, data persistence can be simply achieved by appending data to a file, and reading from the file is fine during reading. The advantage of this is that both read and write operations are O (1), and read operations do not block write operations and other operations. The performance advantage is obvious because the performance has nothing to do with the data size.
Since you can use a hard disk space with almost no capacity limit (relative to the memory) to establish a message system, you can provide some features that are not available in general message systems without performance loss. For example, a common message system deletes a message immediately after it is consumed, but Kafka can save the message for a period of time (for example, a week), which provides consumer with good mobility and flexibility, this will be detailed in future articles.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More