Bcache for accelerating cache for linux Block devices

Source: Internet
Author: User

Bcache is the Linux kernel block layer cache. It uses SSD as the cache of the HDD hard disk to accelerate the operation. HDD disks are cheaper and have a larger space. SSD is faster but more expensive. If you can have both of them, isn't it easy? Bcache can do this. Bcache uses SSD as the cache for other Block devices. Similar to the L2Arc of ZFS, bcache also adds a write-back policy that is irrelevant to the file system. Bcache is designed to work in all environments at minimal cost without configuration. By default, bcache does not cache sequential IO, but only caches random read/write. Bcache is applicable to desktops, servers, advanced storage arrays, and even embedded environments. The purpose of the bcache design is to make the cached device as fast as SSD (including cache hit, cache miss, write-through, and write-back ). It has not yet reached its original intention, especially sequential writing. At the same time, the test results show that the target is very close, or even performs better in some cases, such as random write. Bcache is data security. For the write-back policy cache, reliability is very important, and errors mean data loss. Bcache is an alternative to the battery backup array controller. It also requires data security when an abnormal power loss occurs. For writing, only after all the data is written to the reliable media can the write success be returned to the upper layer. In the case of abnormal power loss, the write cannot be partially completed. A lot of work has been invested in this part of data security. Bcache performance is designed to be equivalent to SSD. Minimize write amplification to the maximum extent and avoid random write. Bcache converts random writes to sequential writes. First, it writes back to the SSD cache, uses the SSD cache for a large number of writes, and finally writes the writes to the disk or array in sequence. For RAID6 arrays, random write performance is poor, and it costs a lot of money to buy array controllers with battery protection. Now with bcache, you can directly use the excellent soft RAID that comes with linux, or even get higher random Write Performance on cheaper hardware.
Feature 1: A cache device can be used as a cache for multiple devices and can dynamically add or delete a cache when the device is running. 2. Abnormal shutdown and recovery. The write is completed only after the data is written to the disk in the cache. 3. Correctly handle write blocking and brush cache 7/25/12 random test SSD disk is Intel 160g mlc ssd, that is, Intel SSDSA2M160. FIO is used as a performance test. The test script is as follows: FIO runs on an SSD bare device, but this type of performance test software should not be affected. The random write test results on the bare SSD device are as follows:

root@utumno:~# fio ~/rw4krandwrite: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64fio 1.59Starting 1 processJobs: 1 (f=1): [w] [100.0% done] [0K/49885K /s] [0 /12.2K iops] [eta 00m:00s]randwrite: (groupid=0, jobs=1): err= 0: pid=1770  write: io=8192.3MB, bw=47666KB/s, iops=11916 , runt=175991msec  cpu          : usr=4.33%, sys=14.28%, ctx=2071968, majf=0, minf=19  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%     issued r/w/d: total=0/2097215/0, short=0/0/0Run status group 0 (all jobs):  WRITE: io=8192.3MB, aggrb=47666KB/s, minb=48810KB/s, maxb=48810KB/s, mint=175991msec, maxt=175991msecDisk stats (read/write):  sdb: ios=69/2097888, merge=0/3569, ticks=0/11243992, in_queue=11245600, util=99.99%
Added bcache:
root@utumno:~# fio ~/rw4krandwrite: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64fio 1.59Starting 1 processJobs: 1 (f=1): [w] [100.0% done] [0K/75776K /s] [0 /18.5K iops] [eta 00m:00s]randwrite: (groupid=0, jobs=1): err= 0: pid=1914  write: io=8192.3MB, bw=83069KB/s, iops=20767 , runt=100987msec  cpu          : usr=3.17%, sys=13.27%, ctx=456026, majf=0, minf=19  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%     issued r/w/d: total=0/2097215/0, short=0/0/0Run status group 0 (all jobs):  WRITE: io=8192.3MB, aggrb=83068KB/s, minb=85062KB/s, maxb=85062KB/s, mint=100987msec, maxt=100987msecDisk stats (read/write):  bcache0: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
After bcache is added, the IOPS is 18.5 kb, And the raw SSD is 12.2 kb. Bcache performs better because bcache sends write requests to SSD in sequence, but adds the overhead of updating indexes. Bcache optimizes random write, and bcache also benefits from high I/O depth (64), because in high I/O depth, multiple subscript updates can be merged into one write request. High IO depth represents a high system load. When I/O depth is lowered, IOPS also changes:

SSD performance fluctuates in different IO depths. Different write models have different results. We only focus on the relative values of the two.
When the random 4 K write and I/O depth are 1, bcache writes twice the number of raw SSD devices: The index needs to be updated for each write.

Random read

Bcache is slightly better, probably related to the data to be read. The conclusion here is that the random read performance of bcache is the same as that of the bare SSD. Note that this test model is not good for bcache by reading data written randomly at 4 K. This means that the btree size is 4 K, and the btree size is much larger than the normal size. In actual application, the average size is 100 K. A larger btree means that the index occupies a larger memory space, and some of them are in secondary indexes. Based on personal experience, these overhead will have an actual impact when the IOPS of a large machine exceeds kb. If you have other testing methods or have any problems with my testing methods, please notify me by email.
The cache is dirty when the system is shut down or the device is removed from the system. That is to say, the data on the backend disk is not reliable. To ensure data security on the backend disk, You need to manually move the cache or set the cache to writethrough mode. To enable the root partition to use bcache, add rootdelay = 3 to the startup parameter so that the udev rule can run before the system mounts the root file system. If a partition or disk device does not create bcache when it is started, a super block error may occur. To enable bcache to correctly detect the previous devices, the udev rule first checks whether the bcache rules and blkid are met. Udev rules check the super block of the device to identify the file system type. If the super block does not conform to the bcache file system type, bcache is not added.

# cat /usr/lib/udev/rules.d/61-bcache.rules....# Backing devices: scan, symlink, registerIMPORT{program}="/sbin/blkid -o udev $tempnode"# blkid and probe-bcache can disagree, in which case don't registerENV{ID_FS_TYPE}=="?*", ENV{ID_FS_TYPE}!="bcache", GOTO="bcache_backing_end"...# lsblk -o NAME,MAJ:MIN,RM,SIZE,TYPE,FSTYPE,MOUNTPOINT,UUID,PARTUUIDNAME        MAJ:MIN RM   SIZE TYPE FSTYPE MOUNTPOINT UUID                                 PARTUUIDsda           8:0    0 111.8G disk├─sda1        8:1    0     3G part vfat   /esp       7E67-C0BB                            d39828e8-4880-4c85-9ec0-4255777aa35b└─sda2        8:2    0 108.8G part ext2              93d22899-cd86-4815-b6d0-d72006201e75 baf812f4-9b80-42c4-b7ac-5ed0ed19be65sdb           8:16   0 931.5G disk└─sdb1        8:17   0 931.5G part ntfs              FAD2B75FD2B71EB7                     90c80e9d-f31a-41b4-9d4d-9b02029402b2sdc           8:32   0   2.7T disk bcache            4bd63488-e1d7-4858-8c70-a35a5ba2c452└─bcache1   254:1    0   2.7T disk btrfs             2ff19aaf-852e-4c58-9eee-3daecbc6a5a1sdd           8:48   0   2.7T disk bcache            ce6de517-7538-45d6-b8c4-8546f13f76c1└─bcache0   254:0    0   2.7T disk btrfs             2ff19aaf-852e-4c58-9eee-3daecbc6a5a1sde           8:64   1  14.9G disk└─sde1        8:65   1  14.9G part ext4   /          d07321b2-b67d-4daf-8022-f3307b605430 5d0a4d76-115f-4081-91ed-fb09aa2318d
In the preceding example, there is an ext2 file system before partitioning. Bcache is automatically built using the following commands:
# make-bcache -B /dev/sdc /dev/sdd -C /dev/sda2
Because the device/dev/sdc and/dev/sdd identify the bcache file system, it is automatically added when the system starts, And/dev/sda2 needs to be manually added. The previous file system's super block information remains at the offset 1024 of/dev/sda2, while the bcache information is recorded starting from the offset 4096. The solution is as follows:
# dd if=/dev/zero count=1 bs=1024 seek=1 of=/dev/sda2
After the system is restarted, all disks are correctly identified:
# lsblk -o NAME,MAJ:MIN,RM,SIZE,TYPE,FSTYPE,MOUNTPOINT,UUID,PARTUUIDNAME        MAJ:MIN RM   SIZE TYPE FSTYPE MOUNTPOINT UUID                                 PARTUUIDsda           8:0    0 111.8G disk├─sda1        8:1    0     3G part vfat   /esp       7E67-C0BB                            d39828e8-4880-4c85-9ec0-4255777aa35b└─sda2        8:2    0 108.8G part bcache            93d22899-cd86-4815-b6d0-d72006201e75 baf812f4-9b80-42c4-b7ac-5ed0ed19be65  ├─bcache0 254:0    0   2.7T disk btrfs             2ff19aaf-852e-4c58-9eee-3daecbc6a5a1  └─bcache1 254:1    0   2.7T disk btrfs             2ff19aaf-852e-4c58-9eee-3daecbc6a5a1sdb           8:16   0 931.5G disk└─sdb1        8:17   0 931.5G part ntfs              FAD2B75FD2B71EB7                     90c80e9d-f31a-41b4-9d4d-9b02029402b2sdc           8:32   0   2.7T disk bcache            4bd63488-e1d7-4858-8c70-a35a5ba2c452└─bcache1   254:1    0   2.7T disk btrfs             2ff19aaf-852e-4c58-9eee-3daecbc6a5a1sdd           8:48   0   2.7T disk bcache            ce6de517-7538-45d6-b8c4-8546f13f76c1└─bcache0   254:0    0   2.7T disk btrfs             2ff19aaf-852e-4c58-9eee-3daecbc6a5a1sde           8:64   1  14.9G disk└─sde1        8:65   1  14.9G part ext4   /          d07321b2-b67d-4daf-8022-f3307b605430 5d0a4d76-115f-4081-91ed-fb09aa2318dd
Similarly, residual Super blocks can cause similar errors.
Http://bcache.evilpiepirate.org/

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.