Bcache for accelerating cache for linux block devices

Source: Internet
Author: User
For details about bcache reprint, refer to http: blogcsdnnetliumangxiongbcache as the Linux kernel block layer cache. It uses the class & amp; 20284; SSD as the cache of the HDD hard disk to accelerate the operation. HDD disks are cheaper and have a larger space. SSD is faster but more expensive. For details about bcache reprinting, refer to http://blog.csdn.net/liumangxiongbcache. It uses SSD as the cache of the HDD hard disk to accelerate the operation. HDD disks are cheaper and have a larger space. SSD is faster but more expensive. If you can have both of them, isn't it easy? Bcache can do this. Bcache uses SSD as the cache for other block devices. Similar to the L2Arc of ZFS, bcache also adds a write-back policy that is irrelevant to the file system. Bcache is designed to work in all environments at minimal cost without configuration. By default, bcache does not cache sequential IO, but only caches random read/write. Bcache is applicable to desktops, servers, advanced storage arrays, and even embedded environments. The purpose of the bcache design is to make the cached device as fast as SSD (including cache hit, cache miss, write-through, and write-back ). It has not yet reached its original intention, especially sequential writing. At the same time, the test results show that the target is very close, or even performs better in some cases, such as random write. Bcache is data security. For the write-back policy cache, reliability is very important, and errors mean data loss. Bcache is an alternative to the battery backup array controller. It also requires data security when an abnormal power loss occurs. For writing, only after all the data is written to the reliable media can the write success be returned to the upper layer. in the case of abnormal power loss, the write cannot be partially completed. A lot of work has been invested in this part of data security. Bcache performance is designed to be equivalent to SSD. Minimize write amplification to the maximum extent and avoid random write. Bcache converts random writes to sequential writes. First, it writes back to the SSD cache, uses the SSD cache for a large number of writes, and finally writes The writes to the disk or array in sequence. For RAID6 arrays, random write performance is poor, and it costs a lot of money to buy array controllers with battery protection. Now with bcache, you can directly use the excellent soft RAID that comes with linux, or even get higher random write performance on cheaper hardware.
Feature 1: a cache device can be used as a cache for multiple devices and can dynamically add or delete a cache when the device is running. 2. abnormal shutdown and recovery. the write is completed only after the data is written to the disk in the cache. 3. correctly handle write blocking and cache flushing. 4. support writethrough, writeback, and writearound5. detect and avoid sequential IO. (you can disable this option) 6. when the SSD latency is detected to exceed the configured boundary value, it is reduced to SSD traffic (used when one SSD is used as multiple disk caches). 7. pre-read when the cache does not hit (disabled by default) 8. high-performance writeback implementation: dirty data is sorted and then written back. If the writeback waterline is set, the PD controller smoothly processes the writeback traffic to the background based on the proportion of dirty data. 9. the high-efficiency B + tree is used, and bcache random reads can reach 1 M IOPS10, which is stable and has been applied to products.

Performance
7/25/12 random test on my test machine, I divide the SSD disk into two partitions of the same size, one partition is used to test the SSD bare disk, and the other is used as the hard disk cache. Modify the bcache configuration: set cache_mode to writeback and writeback_percent to 40. (If writeback_percent is not 0, bcache uses the PD controller to smoothly process the traffic sent to the disk based on the cached dirty data blocks ). The congestion threshold is also disabled, because when the SSD delay reaches the limit, if bcache is switched to writethrough, the result will be affected. The SSD is an Intel 160g mlc ssd, that is, Intel SSDSA2M160. FIO is used as a performance test. the test script is as follows: [global] randrepeat = 1 ioengine = libaio bs = 4 k ba = 4 k size = 8G direct = 1 gtod_reduce = 1 norandommap iodepth = 64FIO runs on an SSD bare device, however, this type of performance testing software should not be affected. The random write test results on the bare SSD device are as follows:
root@utumno:~# fio ~/rw4krandwrite: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64fio 1.59Starting 1 processJobs: 1 (f=1): [w] [100.0% done] [0K/49885K /s] [0 /12.2K iops] [eta 00m:00s]randwrite: (groupid=0, jobs=1): err= 0: pid=1770  write: io=8192.3MB, bw=47666KB/s, iops=11916 , runt=175991msec  cpu          : usr=4.33%, sys=14.28%, ctx=2071968, majf=0, minf=19  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%     issued r/w/d: total=0/2097215/0, short=0/0/0Run status group 0 (all jobs):  WRITE: io=8192.3MB, aggrb=47666KB/s, minb=48810KB/s, maxb=48810KB/s, mint=175991msec, maxt=175991msecDisk stats (read/write):  sdb: ios=69/2097888, merge=0/3569, ticks=0/11243992, in_queue=11245600, util=99.99%
Added bcache:
root@utumno:~# fio ~/rw4krandwrite: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64fio 1.59Starting 1 processJobs: 1 (f=1): [w] [100.0% done] [0K/75776K /s] [0 /18.5K iops] [eta 00m:00s]randwrite: (groupid=0, jobs=1): err= 0: pid=1914  write: io=8192.3MB, bw=83069KB/s, iops=20767 , runt=100987msec  cpu          : usr=3.17%, sys=13.27%, ctx=456026, majf=0, minf=19  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%     issued r/w/d: total=0/2097215/0, short=0/0/0Run status group 0 (all jobs):  WRITE: io=8192.3MB, aggrb=83068KB/s, minb=85062KB/s, maxb=85062KB/s, mint=100987msec, maxt=100987msecDisk stats (read/write):  bcache0: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
After bcache is added, the IOPS is 18.5 KB, and the raw SSD is 12.2 KB. Bcache performs better because bcache sends write requests to SSD in sequence, but adds the overhead of updating indexes. Bcache optimizes random write, and bcache also benefits from high I/O depth (64), because in high I/O depth, multiple subscript updates can be merged into one write request. High IO depth represents a high system load. when I/O depth is lowered, IOPS also changes:

IO depth of 32: bcache 20.3 k iops, raw ssd 19.8 k iops

IO depth of 16: bcache 16.7 k iops, raw ssd 23.5 k iops

IO depth of 8: bcache 8.7 k iops, raw ssd 14.9 k iops

IO depth of 4: bcache 8.9 k iops, raw ssd 19.7 k iops

SSD performance fluctuates in different IO depths. Different Write models have different results. we only focus on the relative values of the two.
When the random 4 K write and I/O depth are 1, bcache writes twice the number of raw SSD devices: the index needs to be updated for each write.

Random Read

IO depth of 64: bcache 29.5 k iops, raw ssd 25.4 k iops

IO depth of 16: bcache 28.2 k iops, raw ssd 27.6 k iops

Bcache is slightly better, probably related to the data to be read. The conclusion here is that the random read performance of bcache is the same as that of the bare SSD. Note that this test model is not good for bcache by reading data written randomly at 4 K. This means that the btree size is 4 K, and the btree size is much larger than the normal size. In actual application, the average size is 100 K. A larger btree means that the index occupies a larger memory space, and some of them are in secondary indexes. Based on personal experience, these overhead will have an actual impact when the IOPS of a large machine exceeds KB. If you have other testing methods or have any problems with my testing methods, please notify me by email.
The cache is dirty when the system is shut down or the device is removed from the system. that is to say, the data on the backend disk is not reliable. To ensure data security on the backend disk, you need to manually move the cache or set the cache to writethrough mode. Automatic mounting of bcache automatically matches the cache and backend devices. The matching process is irrelevant to the order in which the device is available to the system. To enable the root partition to use bcache, add rootdelay = 3 to the startup parameter so that the udev rule can run before the system mounts the root file system. If a partition or disk device does not create bcache when it is started, a super block error may occur. To enable bcache to correctly detect the previous devices, the udev rule first checks whether the bcache rules and blkid are met. Udev rules check the super block of the device to identify the file system type. if the super block does not conform to the bcache file system type, bcache is not added.
# cat /usr/lib/udev/rules.d/61-bcache.rules....# Backing devices: scan, symlink, registerIMPORT{program}="/sbin/blkid -o udev $tempnode"# blkid and probe-bcache can disagree, in which case don't registerENV{ID_FS_TYPE}=="?*", ENV{ID_FS_TYPE}!="bcache", GOTO="bcache_backing_end"...# lsblk -o NAME,MAJ:MIN,RM,SIZE,TYPE,FSTYPE,MOUNTPOINT,UUID,PARTUUIDNAME        MAJ:MIN RM   SIZE TYPE FSTYPE MOUNTPOINT UUID                                 PARTUUIDsda           8:0    0 111.8G disk├─sda1        8:1    0     3G part vfat   /esp       7E67-C0BB                            d39828e8-4880-4c85-9ec0-4255777aa35b└─sda2        8:2    0 108.8G part ext2              93d22899-cd86-4815-b6d0-d72006201e75 baf812f4-9b80-42c4-b7ac-5ed0ed19be65sdb           8:16   0 931.5G disk└─sdb1        8:17   0 931.5G part ntfs              FAD2B75FD2B71EB7                     90c80e9d-f31a-41b4-9d4d-9b02029402b2sdc           8:32   0   2.7T disk bcache            4bd63488-e1d7-4858-8c70-a35a5ba2c452└─bcache1   254:1    0   2.7T disk btrfs             2ff19aaf-852e-4c58-9eee-3daecbc6a5a1sdd           8:48   0   2.7T disk bcache            ce6de517-7538-45d6-b8c4-8546f13f76c1└─bcache0   254:0    0   2.7T disk btrfs             2ff19aaf-852e-4c58-9eee-3daecbc6a5a1sde           8:64   1  14.9G disk└─sde1        8:65   1  14.9G part ext4   /          d07321b2-b67d-4daf-8022-f3307b605430 5d0a4d76-115f-4081-91ed-fb09aa2318d
In the preceding example, there is an ext2 file system before partitioning. Bcache is automatically built using the following commands:
# make-bcache -B /dev/sdc /dev/sdd -C /dev/sda2
Because the device/dev/sdc and/dev/sdd identify the bcache file system, it is automatically added when the system starts, and/dev/sda2 needs to be manually added. The previous file system's super block information remains at the offset 1024 of/dev/sda2, while the bcache information is recorded starting from the offset 4096. the solution is as follows:
# dd if=/dev/zero count=1 bs=1024 seek=1 of=/dev/sda2
After the system is restarted, all disks are correctly identified:
# lsblk -o NAME,MAJ:MIN,RM,SIZE,TYPE,FSTYPE,MOUNTPOINT,UUID,PARTUUIDNAME        MAJ:MIN RM   SIZE TYPE FSTYPE MOUNTPOINT UUID                                 PARTUUIDsda           8:0    0 111.8G disk├─sda1        8:1    0     3G part vfat   /esp       7E67-C0BB                            d39828e8-4880-4c85-9ec0-4255777aa35b└─sda2        8:2    0 108.8G part bcache            93d22899-cd86-4815-b6d0-d72006201e75 baf812f4-9b80-42c4-b7ac-5ed0ed19be65  ├─bcache0 254:0    0   2.7T disk btrfs             2ff19aaf-852e-4c58-9eee-3daecbc6a5a1  └─bcache1 254:1    0   2.7T disk btrfs             2ff19aaf-852e-4c58-9eee-3daecbc6a5a1sdb           8:16   0 931.5G disk└─sdb1        8:17   0 931.5G part ntfs              FAD2B75FD2B71EB7                     90c80e9d-f31a-41b4-9d4d-9b02029402b2sdc           8:32   0   2.7T disk bcache            4bd63488-e1d7-4858-8c70-a35a5ba2c452└─bcache1   254:1    0   2.7T disk btrfs             2ff19aaf-852e-4c58-9eee-3daecbc6a5a1sdd           8:48   0   2.7T disk bcache            ce6de517-7538-45d6-b8c4-8546f13f76c1└─bcache0   254:0    0   2.7T disk btrfs             2ff19aaf-852e-4c58-9eee-3daecbc6a5a1sde           8:64   1  14.9G disk└─sde1        8:65   1  14.9G part ext4   /          d07321b2-b67d-4daf-8022-f3307b605430 5d0a4d76-115f-4081-91ed-fb09aa2318dd
Similarly, residual super blocks can cause similar errors.
Http://bcache.evilpiepirate.org/
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.